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Preface 



This volume contains the papers presented at the fifth workshop on Job 
Scheduling Strategies for Parallel Processing, which was held in conjunction with 
the IPPS/SPDP’99 conference in San Juan, Puerto Rico, on April 16, 1999. The 
papers have been through a complete refereeing process, with the full version 
being read and evaluated by five to seven members of the program committee. We 
would like to take this opportunity to thank the program committee, Andrea 
Arpaci-Dusseau, Stephen Booth, Allen Downey, Allan Gottlieb, Atsushi Hori, 
Phil Krueger, Richard Lagerstrom, Miron Livny, Virginia Lo, Reagan Moore, Bill 
Nitzberg, Uwe Schwiegelshohn, Ken Sevcik, Mark Squillante, and John Zahorjan, 
for an excellent job. Thanks are also due to the authors for their submissions, 
presentations, and final revisions for this volume. Finally, we would like to thank 
the MIT Laboratory for Computer Science and the Computer Science Institute 
at the Hebrew University for the use of their facilities in the preparation of these 
proceedings. 

This was the fifth annual workshop in this series, which reflects the continued 
interest in this field. The previous four were held in conjunction with IPPS’95 
through IPPS/SPDP’98. Their proceedings are available from Springer- Ver lag 
as volumes 949, 1162, 1291, and 1459 of the Lecture Notes in Computer Science 
series. 

Since our first workshop, parallel processing has evolved to the point where it 
is no longer synonymous with scientific computing on massively parallel super- 
computers. In fact, enterprise computing on one hand and metasystems on the 
other hand often overshadow the original uses of parallel processing. This shift 
has underscored the importance of job scheduling in multi-user parallel systems. 
Correspondingly, we had a session in the workshop devoted to job scheduling 
on standalone systems, emphasizing gang scheduling, and another on scheduling 
for meta-systems. A third session continued the trend from previous workshops 
of discussing evaluation methodology and workloads. 

An innovation this year was a panel discussion on the possible standardization 
of a workload benchmark that will serve for the evaluation of different schedulers. 
The panelists positions as well as much of the discussion have been written up 
as a paper that appears in these proceedings. 



May 1999 



Dror Feitelson 
Larry Rudolph 
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Scheduling for Parallel Supercomputing: 

A Historical Perspective of Achievable Utilization 



James Patton Jones' and Bill Nitzberg' 

MRJ Technology Solutions 
NASA Ames Research Center, M/S 258-6 
Moffett Field, CA 94035-1000 

jjones@nas.nasa.gov 



Abstract.The NAS facility has operated parallel supercomputers for the past 1 1 
years, including the Intel iPSC/860, Intel Paragon, Thinking Machines CM-5, 
IBM SP-2, and Cray Origin 2000. Across this wide variety of machine 
architectures, across a span of 10 years, across a large number of different 
users, and through thousands of minor configuration and policy changes, the 
utilization of these machines shows three general trends: (1) scheduling using a 
naive FCFS first-fit policy results in 40-60% utilization, (2) switching to the 
more sophisticated dynamic backfilling scheduling algorithm improves 
utilization by about 15 percentage points (yielding about 70% utilization), and 
(3) reducing the maximum allowable job size further increases utilization. Most 
surprising is the consistency of these trends. Over the lifetime of the NAS 
parallel systems, we made hundreds, perhaps thousands, of small changes to 
hardware, software, and policy, yet utilization was affected little. In particular, 
these results show that the goal of achieving near 100% utilization while 
supporting a real parallel supercomputing workload is unrealistic. 



1.0 Introduction 

The Numerical Aerospace Simulation (NAS) supercomputer facility, located at 
NASA Ames Research Center, serves in the role of pathfinder in high performance 
computing for NASA. In the late 1980s, we began exploring the use of highly parallel 
systems for supporting scientific and technical computing [1]. Today, it is commonly 
accepted that “supercomputing” is synonymous with “parallel supercomputing”. 

Supercomputing means running “big” jobs or applications which cannot be run on 
small or average-sized systems. Big, of course, is a relative term; we generally con- 
sider a job big if it is using at least half of the available resources of a big system. (We 
leave the definition of “big system” to the reader.) 



1. Work performed under NASA contract NAS2-14303, Moffett Field, CA 94035-1000 
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Traditional vector supercomputers (e.g., the Cray C90) are capable of sustaining 
nearly 100% utilization while supporting big jobs and running a varied workload [2]. 
Our experience has shown that this level of utilization is not attainable when running 
a supercomputing workload on a parallel supercomputer. 



2.0 The NAS Parallel Supercomputing Workload 

The NAS facility supports research and development in computational aerosciences. 
Hundreds of research projects are funded annually which use the parallel supercom- 
puters at NAS to perform high-end scientific and technical computing. Over the past 
1 1 years, the NAS parallel workload, priorities, and approach have been consistent. 

The workload consists of a mix of: 

• hundreds of users; new users are constantly added 

• scientific and technical computing for aerospace applications 

• code development, debugging, scaling and performance analysis 

• “production” runs of existing applications 

At the same time, the NAS scheduling policy has consistently striven for (in order of 
priority): 

1. Overnight turn-around for big jobs, and 

2. Good machine utilization. 

The first priority supports “supercomputing”, the second supports efficient use of 
resources. NAS supports supercomputing by favoring supercomputer- sized jobs (big 
ones, typically those that cannot run on any other system within NASA) over smaller 
jobs. In general, the system configuration, user allocations, and scheduling policies 
are tuned so that big jobs get overnight turn-around. 

In apparent conflict to the first priority is the second. Good machine utilization has 
historically meant 99% on traditional vector supercomputers, and the stakeholders 
(those whose money purchased the machines) have traditionally used utilization as a 
measure of success. As we show, parallel supercomputing does not achieve 99% utili- 
zation. It should be noted that machine utilization is arguably not the best measure of 
the “goodness” or value of a computing system. This issue is discussed further in sec- 
tion 5 below. System utilization is used as the basis of comparison in this paper prima- 
rily because utilization was the single largest continuous dataset available for the 
systems under discussion. 

The system configuration and the mechanisms by which we let users run jobs has also 
been consistent throughout the past 1 1 years. The systems are all space-shared (parti- 
tioned), and batch scheduled. Interactive use is permitted, but it must take place by 
allocating resources via the batch system, then using those resources interactively. 
This approach to using parallel computers has prevailed, despite the availability of 
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good time-sharing and gang-scheduling facilities on several systems, for two reasons: 
the need for consistency of timings and efficiency of execution. Analysis of algo- 
rithms, exact megaflop rates, and scaling are major component of the NAS workload. 
Approaches other than strict, partitioned space sharing don’t support this. Further- 
more, systems such as the Intel Paragon and Cray Origin 2000 suffer from an interfer- 
ence problem (discussed below), in which it is possible for jobs to “overlap” in such a 
way as to slow each other down by far more than would be expected by the simple 
sharing of resources. 

Most of the applications run at NAS are statically balanced (applications which 
require a well balanced load across all nodes). Statically balanced applications strive 
to give an equal amount of work to each processor. A single slow process in a stati- 
cally load-balanced application can completely ruin the performance of the applica- 
tion, as other processes will have to wait for it. Another issue arises from message- 
passing synchronization. Even if we overlay parallel jobs to avoid load-balancing 
problems, tightly synchronized applications can incur an extra synchronization delay 
for messages because processes are not gang- scheduled (scheduled to run at the same 
time across all their assigned nodes). These constraints are consistent across typical 
parallel supercomputing workloads. (For further discussion of parallel supercomput- 
ing workloads, see [12].) 



3.0 Supercomputer Resource Sharing 

In a supercomputer, the two resources most visible to the user are the CPU and the 
memory. In parallel supercomputers, these resources are generally grouped together 
as compute nodes. This paper focuses on node utilization. There are two methods of 
sharing CPUs in a large MPP system: time sharing and space sharing. Time sharing 
allows different programs to run on the same node simultaneously. The operating sys- 
tem is responsible for scheduling different programs to run, each for a certain time 
slice (quantum). Space sharing (also known as tiling) gives a parallel application 
exclusive access to a set of compute nodes on which to run. 

The five systems under review have a mixture of node sharing methods, as shown in 
Table 1. 





Intel 


TMC 


Intel 


IBM 


SGI 


Parallel Systems 


IPSC/860 


CMS 


Paragon 


SP-2 


0rigin2000 


Gang Scheduling 




✓ 


unusable 






Time Sharing 




✓ 


unusable 


✓ 


✓ 


Space Sharing 


✓ 


✓ 


✓ 


✓ 


✓ 



Table 1. MPP Node Sharing Methods 
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With the exception of the IPSC/860, all these systems support timesharing. Time- 
sharing works well for serial jobs, which can fit into the memory of a single node. But 
at NAS we are interested in running parallel applications that cannot run on a single 
node because of the resources they require (such as more memory than is available, or 
so much CPU that the single-node run time would be too long). It is often assumed 
that parallel jobs can be timeshared automatically by the operating system. In reality, 
however, the issues of load balancing and synchronization make timesharing unac- 
ceptable for the parallel applications in the NAS workload. 

The issues of timesharing the NAS workload on parallel systems was discussed in 
detail several years ago [10]. The following excerpt is relevant here: 

Among the statically balanced applications is a very important class of tightly 
synchronized communication-intensive codes. These applications form a large 
part of the NAS work load, and tightly synchronized applications are common 
in other fields. The reason these applications cannot be efficiently timeshared 
is the accumulation of communication delays created by the uncoordinated 
scheduling of processes across the nodes. In tightly synchronized applications, 
information flows between nodes as the calculation progresses. Even when 
there is only nearest neighbor communication, information flows from neigh- 
bor to neighbor eventually reaching all nodes. Every time that information flow 
is disrupted the entire application slows down. 

One solution to this problem is to coordinate time sharing across the nodes of a paral- 
lel application, so that all processes of a given application run at the same time. This is 
called gang- scheduling or co-scheduling [9]. Gang-scheduling requires operating sys- 
tem support, and a scheme for handling communication in progress (i.e. no messages 
should be lost when processes are swapped) [3]. Although not necessary, it is easier to 
implement gang scheduling if nodes are grouped into fixed size partitions, though 
jobs then do not have flexibility in how many nodes they run on. Since all nodes in a 
partition run the same number of processes, the scheduler does not have to deal with 
unbalanced scheduling. Gang-scheduling in fixed-size partitions is an effective way to 
deal with the problems of timesharing. 

Unfortunately, only the Paragon and the CMS supported gang scheduling. The Para- 
gon’s implementation of gang scheduling added so much instability to the system that 
it proved unusable. On the CMS we determined that the performance impact out- 
weighed any benefits gang scheduling would have provided. 

The second scheduling method is space sharing. In this model, a parallel application 
is given exclusive access to a set of compute nodes on which to run. If any timeshar- 
ing occurs, it is with the mostly inactive UNIX daemons. The parallel computer can 
then be divided between different parallel applications, without one competing with 
another for resources. Space sharing, however, does increase the difficulty of batch 
job scheduling. Space sharing can also lower the ultilization of other system 
resources. Since jobs have exclusive control over the nodes, space sharing controls the 
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CPU together with other system resources, like memory, disks, and interconnect 
bandwidth. If the application is not using those resources, it is wasted. 

Given the vast proportion of the NAS workload that consists of message-pasing appli- 
cations, space sharing has repeatably been chosen for the NAS parallel supercomput- 
ers in order to deliver the greatest per-application performance and consistent 
runtimes. 



4.0 Analysis of NAS Parallel Supercomputer Utilization 

In this section we describe the hardware configuration of five NAS parallel supercom- 
puters and discuss batch job scheduling and utilization for each. (For a discussion of 
batch job scheduling requirements for large parallel supercomputers, like those at 
NAS, see [10].) During the analysis, several trends begin to emerge. These are dis- 
cussed as they become apparent in the data presented. 

Figures 1-6 show the weekly available node utilization for each of the five NAS MPP 
systems under review, from installation (or when the system was stable enough to put 
users on) until decommission. By “available” node utilization, we mean that both 
scheduled and unscheduled outages have been taken into consideration when calculat- 
ing the percentage utilization. 

4.1 Intel iPSC/860 (Jan. 1990 to Sept. 1994) 

The Intel iPSC/860 (also known as the Touchstone Gamma System) is a MIMD 
parallel computer. The system at NAS consisted of 128 compute nodes (each with a 
single 40 MHz i860 XR processor and eight megabytes of physical memory), ten I/O 
nodes (each with an i386 processor and four megabytes of memory), one service node 
(with a single 1386 processor, four megabytes of memory, and an ethernet interface), 
and an i386-based PC front end with eight megabytes of memory. The compute nodes 
are connected via a wormhole-routed hypercube network, which delivers 2.8 
megabytes per second per link. 

The Network Queueing System (NQS, [8]) was used as the batch system, implement- 
ing queue-level “first-come first-serve first-fit” (FCFS-FF) scheduling with different 
size priorities during the day (i.e. big jobs had priority at night, small jobs during the 
day). The FCFS-First-Fit algorithm works as follows: batch jobs are evaluated in 
FCFS order in the queue, i.e. oldest job first. For each job, the batch system first 
checked if there were enough nodes available to run the job, and if so, then compared 
the job requirements (walltime and node count) to the current scheduling policy. If 
either of these two checks failed, the scheduler skipped to the next job. If both were 
successful, the scheduler ran the job and removed it from the list. This process contin- 
ued until all the jobs were evaluated. 
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Scheduling on the iPSC/860 was relatively simple, as the system itself was very 
inflexible. The architecture divided the system into “partitions” each of a power-of-2 
number of nodes. Batch jobs were then run in the smallest possible partition size. This 
made scheduling easier, but forced idle time when running medium- sized jobs. For 
example, a 65-node job could run only in a 128-node partition. 



too 



80 



= 60 



40 



Weekly Node Utilization 
3-Month Average 



20 h 



Jun.93 



Jun.94 



Aug.94 



Fig. 1. iPSC/860 Utilization 

Since there was no timesharing available, this forced the remaining 63 nodes to be left 
idle. Furthermore, there existed a system limit of a maximum of ten concurrent parti- 
tions. This limit also had the potential for forcing idle time, even when there was a 
backlog of work. For example, if the iPSC/860 was running ten 2-node jobs, the 
remaining 108 nodes would be idle. But given the typical job size in the NAS work- 
load, the maximum partition limit was rarely exceeded. (The system ran out of nodes 
well before it allocated ten partitions.) 

The iPSC/860 was fairly unreliable during the first two years at NAS. The first year 
the system was thoroughly investigated by NAS staff, during which time a variety of 
benchmarks were developed and run on the system. Figure 1 shows the node utiliza- 
tion starting in mid- 1993. (Full accounting data for the first two years is unavailable.) 
At the time, the utilization shown was considered an impressive improvement over 
that of previous years, and is primarily attributable to two factors. The first was a sig- 
nificant increase in system stability. Second, in early 1993, users had begun to shift 
from application debugging to running their codes as “production” batch jobs. Notice 
that the utilization ranged between 40 and 60 percent for most of the period shown. 
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4.2 TMC CM-5 (Jan. 1993 to Mar. 1995) 

The Thinking Machines Corporation (TMC) CM-5 is a MIMD parallel computer, 
although it retains some properties of its SIMD predecessor, the CM-2. Notably, each 
processing node of the CM-5 can be thought of as a small SIMD parallel computer, 
each with a master SPARC processor sequencing a 2 x 2 array of custom vector units. 
Furthermore, via a dedicated control network, the processing nodes can also operate 
in a synchronized SPMD fashion. 

The system at NAS consisted of 128 compute nodes (each with one 33 MHz SPARC 
processor, four vector units, and 32 megabytes of physical memory), four control 
nodes, one I/O node (which manages the attached RAID), and one system-manager 
node. Each of these nodes was a standard Sparc-2 workstation (with 64 megabytes of 
physical memory). The nodes are interconnected via a 4-ary fat-tree data network, 
which supplies 20 megabytes per second per link, and a bisection bandwidth of 655 
megabytes per second. 

The CM-5 was scheduled using the Distributed Job Manager (DJM) which also 
implemented a size-priority FCFS-FF algorithm that was time-of-day sensitive. Fike 
the iPSC/860, the CM-5 architecture restricted all partitions to a power-of-2 number 
of nodes. However, the CM-5 further restricted the partition size to a minimum of 32 
nodes, and the partition size could not be changed without a reboot of the entire sys- 
tem. During the day, the CM-5 was run with one 64-node partition and two 32-node 
partitions. Each night, the system was reconfigured into a single 128-node partition to 
allow large jobs to run. 
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Fig. 2. CM-5 Utilization 
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The CM-5 followed quite a different life cycle than the iPSC/860. Initially only NAS 
staff had access for benchmarking and evaluation purposes, and to work to stabilize 
the system. But rather than taking several years as with the iPSC/860, we had to wait 
only several weeks before putting “real” users on. Figure 2 shows how quickly the 
scientists put the CM-5 to work. Part of the reason for the short ramp-up was that 
most of the researchers were migrating to the CM-5 from the previous generation 
CM-200 (which had been previously upgraded from a CM-2) at NAS. Many of these 
users already had codes that ran well on this architecture. 

Like the iPSC/860 much of the increased usage of the CM-5 in its final year was due 
to users completing the debugging cycle, and moving to running production codes. 
Halfway through the second year of the system’s stay at NAS, in an effort to increase 
the utilization of the machine, space sharing was relaxed on the small partitions dur- 
ing the day to allow two jobs to timeshare within a given partition. Doing so resulted 
in a gradual increase of utilization; however, it also resulted in a 20 percent slowdown 
in both timeshared applications. 

4.3 Intel Paragon XP/S-15 (Feb. 1993 to July 1995) 

The Intel Paragon XP/S-15 is a MIMD parallel computer. The system at NAS 
consisted of 208 compute nodes (each with two 50 MHz i860 XP processors and 32 
megabytes of physical memory), four service nodes (which make up the service 
partition and provide an interface to the outside world, serving as a “front end” to the 
system), eight disk I/O nodes, three HIPPI nodes, and four general-purpose nodes. 
The compute nodes are connected via a wormhole-routed 2D mesh network, which 
delivers 175 megabytes per second per link. The Paragon is the successor of the Delta 
machine. 

Using NQS, we implemented queue-level FCFS-FF scheduling with different size pri- 
orities, as on the iPSC/860. Scheduling the Paragon, however, was more difficult than 
scheduling the previous systems because power-of-2 job sizes were no longer 
required. The resulting wide variety of job sizes decreased the scheduling efficiency. 

The Paragon, like the CM-5, had a relatively short shake-out period before we started 
adding users onto the system. These were primarily users from the iPSC/860 who 
wanted to try out the new system. Once on, many chose to return to the iPSC/860 until 
the Paragon stabilized. 

The utilization shown in Figure 3 for the first half of 1993 is based on UNIX SAR 
(system activity report) and load average data. Some data for 1993 were lost (thus the 
apparent zero usage). Following this, the MACS accounting software was installed, 
enabling more accurate utilization tracking. This is also the time when the remaining 
iPSC/860 users began to migrate over to the Paragon in order to continue their work. 
(Compare the iPSC/860 and the Paragon utilization graphs in Figure 6 to see more 
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clearly the drop in the older system corresponding to an increase in the newer sys- 
tem.) 
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Fig. 3. Paragon Utilization 

The periodic dips in the Paragon utilization correspond to testing and regular operat- 
ing system upgrades. During these times, though the system was available to users, 
many opted off the machine to avoid the frustrations of trying to use a system in flux. 
From the utilization graph, we see that the FCFS-FF algorithm maintained the utiliza- 
tion between and 40 and 60 percent for most of the “production” lifetime of the sys- 
tem. 

4.4 IBM SP-2 (July 1994 to Sept. 1997) 

The IBM SP-2 is a MIMD parallel computer. The system at NAS consisted of 160 
compute nodes. Each node is an IBM RS6000/590 workstation powered with a single 
66.7 MHz POWER2 processor and at least 128 megabytes of physical memory. (Six 
of the nodes had 512 megabytes of memory). The nodes of an SP-2 are connected by 
a packet-switched, multi-stage omega network (a hierarchy of crossbar switches) 
utilizing buffered wormhole-routing. The switch can deliver 40 megabytes per second 
bidirectionally. 

The system arrived with IBM’s LoadLeveler batch system, which provided simple 
FCES scheduling. Based on what we had learned with previous parallel systems, we 
predicted that a ECES-FE scheduling algorithm would result in a system node utiliza- 
tion of around 50 percent. However, the Loadleveler batch system used a simple 
FCES algorithm which was achieving roughly 25 percent utilization. After six 
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months, we replaced LoadLeveler with the Portable Batch System (PBS) utilizing a 
FCFS-FF algorithm [5]. System utilization immediately doubled (see Figure 4), aver- 
aging 50 percent. This level of utilization continued the entire time we used the 
FCFS-FF scheduler. 




Fig. 4. SP-2 Utilization 



One problem with the First-Fit algorithm is that it can continually take away nodes 
from a large waiting job in order to run smaller jobs. As a result, large jobs tend to 
“starve” in the queue, always waiting for resources to become available, but never 
receiving them. One attempt to remedy this situation is to periodically “drain” the sys- 
tem in order to free up enough nodes to be able to run the large waiting jobs. How- 
ever, draining the system is an expensive operation since a large number of nodes may 
need to be kept idle to ensure that a particular job can run. For example, suppose we 
want to run a 128-node job, but there are only 127 nodes available, and there is a 5- 
hour job running on the last node. We have to keep 127 nodes idle for 5 hours to guar- 
antee that this job will run. While this simplistic approach works, it is obvious that it 
does not lead to the best system utilization possible. 



A better solution is to have the scheduler recognize that we will not be able run our 
job for 5 hours, but we can use the 127 nodes to run any jobs that can complete in 5 
hours, using a backfilling method. Static-Backfilling fixes the starting time of the 
high-priority job at the earliest time possible (i.e., the earliest time when the required 
nodes will be available). In our previous example, this was 5 hours. A Dynamic-Back- 
filling (DBF) algorithm, rather than fixing the time at the earliest time that it can, will 
instead determine the most appropriate time to start the job within a starting window 
[4, 1 1]. The earliest time possible may in some cases not lead to the best result. Using 
our previous example, let’s assume that we also had a 125-node job (for 5 hours 30 
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minutes) queued. Using static-backfilling, we could not run this job as we only have 5 
hours to backfill. But with dynamic -backfilling, we should recognize that by shifting 
the starting time by 30 minutes, we will be able to fill 125 nodes, significantly 
increasing resource utilization. Thus the DBF algorithm attempts continually to bal- 
ance both the need to run high-priority jobs and the need to maintain as many nodes 
as possible in-use, by providing the ability to drain the system efficiently and to 
reserve nodes for the top-ranked job. 

As soon as we installed PBS, we began implementing the DBF algorithm. Approxi- 
mately two months later, the PBS FCFS-FF scheduler module was replaced with our 
PBS DBF scheduler module. (This DBF scheduling module is now included in the 
PBS distribution.) System utilization again jumped, this time roughly 20 percentage 
points, to 70 percent. Over time, as users began to run fewer debugging jobs and 
started scaling up the problem size of their applications, the utilization slowly crept up 
to 80 percent. The system continued at this level of usage until it was allocated to the 
NASA Metacenter project. (Given that the Metacenter project introduced many new 
variables, and thereby substantially changing the userbase, scheduling model and 
workload flow of the system, we do not report those data here. Further discussion of 
the Metacenter project and its meta-scheduling experiments is included in [6, 7].) 

4.5 Cray 0rigin2000 (Jan. 1997 - ) 

The SGI/Cray 0rigin2000 is a MIMD computer based on the cache-coherent non uni- 
form memory architecture (ccNUMA) providing a distributed shared memory (DSM) 
system. Each node contains two MIPS RISC 64-bit R 10000 processors and a config- 
urable amount of memory; nodes are connected via a modified hypercube network 

In January 1997, NAS received its first 32-processor SGI Origin2000. (Systems larger 
than 32-processors receive the “Cray” label.) One of the most useful features of the 
Origin2000 is the single-system image. Users can utilize the system as a large sym- 
metrical multiprocessor (SMP) rather than having to be concerned with the distributed 
nature of the architecture. Therefore when the system first arrived, we decided to see 
if we could schedule it like a true timeshared SMP. We installed PBS with a FCFS-FF 
SMP scheduling module that we had been running on our cluster of four Cray J90s. 

In spite of its single system image, the attempt to schedule this distributed memory 
system quickly proved problematic, as both the underlying architecture and the inter- 
ference between jobs resulted in severe performance degradation and varied runtimes. 
Specifically, since the hardware components are distributed across the interconnected 
hypercubes, the number of network routers between any two nodes within the system 
increases with the distance between those nodes. This alone translates into a variable 
latency in communication for message passing applications. Since applications could 
be started on any set of nodes within the system, the runtimes of a given application 
varied from run to run. 
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In addition, the operating system attempts to start applications on contiguous nodes, 
but as the system “fills up” with work the nodes quickly become fragmented since 
every application has a different runtime. As a result, applications are started on nodes 
that are scattered throughout the system. As this tends to increase the distance 
between the nodes assigned to a given job, latency again increases, as do runtimes. 

The third problem we identified was in the “sharing” of nodes between jobs. Remem- 
ber that an 0rigin2000 “node” has two processors which have equal access to locally 
shared memory. Scheduling multiple applications onto the same node makes it possi- 
ble for these processes to compete for the memory on that node, delaying message- 
passing and thereby further increasing the runtime of the application as a whole. 
While many sites run applications which are tolerant of these conditions, the applica- 
tions run at NAS displayed a range of variation in runtimes, from 30% on a lightly 
loaded system up to 300% on a fully loaded system. 

Needless to say, we quickly turned to another solution. We switched to software parti- 
tioning of the system, where we assigned sets of processors to specific partitions. 
Each partition was assigned an SGI NODEMASK which identified which nodes 
belonged to that partition. Batch jobs were then scheduled into the smallest possible 
partition. (A NODEMASK is similar to a “processor set”, except it is node-based 
rather than CPU-based. While not a perfect solution, the NODEMASK capability 
proved quite functional, even though it was only advisory to the kernel. Some of this 
functionality will be made available in the SGI MISER kernel scheduler.) 




Fig. 5. 0rigin2000 Utilization 
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In March 1997, we doubled the size of our 0rigin2000 system, creating a 64-proces- 
sor parallel supercomputer running under a single system image. Figure 5 shows the 
utilization starting with the installation of this system. 

We scheduled each of the partitions within the 0rigin2000 as we did on the previous 
parallel systems. Figure 5 shows the system utilization that resulted: a rough average 
of 35 percent. From our previous experience, we predicted that adding dynamic back- 
filling to the scheduling algorithm would add another 1 5 percentage points to the uti- 
lization. Our prediction was again borne out: average system utilization increased 
from about 35 percent to about 55 percent. Nearly a year later NAS purchased a sec- 
ond 64-processor 0rigin2000 system. Rather than being run as a separate system, it 
was configured to share the workload of the first system. 

Another trend we have noticed is that if you change the definition of “big jobs” in 
relation to the maximum number of nodes available, utilization will increase. We pre- 
dicted that by adding this second 64-processor system to the compute pool (thereby 
doubling the number of processors available for computation) but maintaining the 
maximum job size at 64-processors, utilization should increase. We were surprised at 
how smoothly the utilization curve in Figure 5 spanned the doubling of the resources 
without increasing the maximum job size. This appears to be in part a result of the 
second system being identical to the first. The amount of resources doubled, and the 
users responded by submitting twice as many jobs as before. Turnaround time and uti- 
lization remained constant, but throughput doubled.. 




Fig. 6. Comparison of Parallel Supercomputer Utilization 
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Figure 5 ends at the point of installation of two additional 128-processor 0rigin2000 
systems. Given that the full user base was not given immediate access, there are insuf- 
ficient data to make a fair comparison between these new systems and those already 
reported. Then, in mid-November, the two 128-processor systems were merged into 
the first 256-processor 0rigin2000, and the full NAS parallel system user base was 
given access. Accounting, scheduling policy, and the scheduler itself have been con- 
tinually changing since that time. Not until the system changes are stabilized will we 
be able to compare the 256-processor origin to the other systems reviewed herein 

In order to make the comparison of the various system utilization graphs easier, a 
composite graph is included above. Figure 6 shows the lifetime utilization of each of 
the five MPP systems, along a single timeline. It is useful to compare the arrival and 
decommission of the systems with each other. From such a composite graph, it is eas- 
ier to see that the iPSC/860 utilization began dropping as the Paragon started rising. 



5.0 Caveats and Conclusions 

First, a caveat. There are hundreds of variables which contribute to effective schedul- 
ing of a parallel supercomputer. We have ignored all but one — the scheduling algo- 
rithm — in our analysis. Further, we make some broad, sweeping generalizations 
which are heavily dependent on the user workload of the system. Although experi- 
ence has shown that the NAS facility is a prototypical supercomputing center, one 
should not totally discount the possibility that the unique features of this facility con- 
tribute to these results. 

Data gathered over 1 1 years of operating parallel supercomputers (including the Intel 
iPSC/860, Intel Paragon, Thinking Machines CM-5, IBM SP-2, and Cray Origin 
2000) show three distinct trends: 

• scheduling using a naive FCFS first-fit policy results in 40-60% utiliza- 
tion, 

• switching to the more sophisticated dynamic-backfilling scheduling algo- 
rithm improves utilization by about 15 percentage points (yielding about 
70% utilization), and 

• reducing the maximum allowable job size increases utilization. 

Most surprising is the consistency of these trends. Over the lifetime of the NAS paral- 
lel systems, we made hundreds, perhaps thousands, of small changes to hardware, 
software, and policy. Yet, utilization was affected little, in general increasing slowly 
over the lifetime of the system, with the few significant increases attributable to 
improvements in the scheduling algorithms. In particular, these results show that the 
goal of achieving 100% utilization while supporting a real parallel supercomputing 
workload is currently unrealistic. The utilization trends are similar irrespective of 
what system is used, who the users are, and what method is used for partitioning 



resources. 
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6.0 Future Work 

The “goodness” of a scheduling algorithm and policy is hard to quantify. Although 
utilization is the most commonly used metric, it is still inadequate. Utilization does 
not take into account properties of the workload which, even if the jobs were opti- 
mally scheduled, would not yield 100% utilization. Such workload parameters as job- 
size mix, arrival rate of jobs, efficiency of applications, etc., are ignored. A better met- 
ric is needed. The utilization we report is simply the amount of time processors are 
assigned out of the total time processors are available (i.e., we ignore only system 
down time). We would like to refine utilization to be based not on the total uptime of a 
system, but on the optimal scheduling of the given workload. Other metrics we would 
like to explore are throughput measures and comparison of “time-to-solution”. 

Seven years of job accounting data for five different parallel supercomputers is a lot 
of data; we have only scratched the surface. We would like to analyze big-job turn- 
around times in a fashion similar to our analysis of utilization trends. Further, we 
would like to investigate correlations between system stability (crashes), user load, 
turnaround time, workload characteristics, utilization, and, if possible, system culture 
(e.g., night time vs. day time, conference deadlines, etc.). 
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Abstract. In this paper we suggest a strategy to design job schedul- 
ing systems. To this end, we first split a scheduling system into three 
components: Scheduling policy, objective function and scheduling algo- 
rithm. After discussing the relationship between those components we 
explain our strategy with the help of a simple example. The main focus 
of this example is the selection and the evaluation of several scheduling 
algorithms. 



1 Introduction 

Job scheduling for processors is a complex task. This is especially true for mas- 
sively parallel processors (MPPs) where many users with a multitude of different 
jobs share a large amount of system resources. While job scheduling does not af- 
fect the results of a job, it may have a significant influence on the efficiency of the 
system. For instance, a good job scheduling system may reduce the number of 
MPP nodes that are required to process a certain amount of jobs within a given 
time frame or it may permit more users or jobs to use the resources of a machine. 
Therefore, the job scheduling system is an important part in the management 
of computer resources which frequently represent a significant investment for a 
company or institution in the case of MPPs. 

Hence, the availability of a good job scheduling system is in the interest of 
the owner or administrator of an MPP. It is therefore not surprising that in the 
past new job scheduling methods have been frequently introduced by institutions 
which were among the first owners of MPPs like, for instance, ANL [10], CTC [11] 
or NASA Ames [12]. On the other hand, some machine manufacturers showed 
only limited interest in this issue as they frequently seem to have the opinion 
that “machines are not sold because of superior job schedulers” . Moreover, the 
design of a job scheduling system must be based on the specific environment of 
a parallel system as we will argue in this paper. Consequently, administrators of 
MPPs will remain to be involved in the design of job scheduling systems in the 
future. Methods or at least guidelines for the selection and evaluation of such a 
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system would therefore be beneficial. It is the goal of this paper to make a first 
a step into this direction. 

We start this paper by taking a close look at job scheduling systems. For us 
such is system is divided into 3 components: Scheduling policy, objective function 
and scheduling algorithm. We discuss those components and the dependences 
between them. In the second part of the paper we use a simple example to 
describe the process of scheduling algorithm selection and evaluation. We do 
not believe that there is a single scheduling algorithm that suits all systems. 
Therefore, it is not the purpose of this example to show the superiority of any 
particular algorithm but to illustrate a method for the design of scheduling 
systems. 



2 Scheduling Systems 

The scheduling system of a multiprocessor receives a stream of job submission 
data and produces a valid schedule. We use the term ‘stream’ to indicate that 
submission data for different jobs need not arrive at the same time. Also the 
arrival of any specific data is not necessarily predictable, that is, the scheduling 
system may not be aware of any data arriving in the future. Therefore, the 
scheduling system must deal with a so called ‘on-line’ behavior. 

Further, we do not specify the amount and the type of job submission data. 
Different scheduling systems may accept or require different sets of submission 
data. For us submission data comprise all data which are used to determine a 
schedule. However, a few different categories can be distinguished: 

— User Data: These data may be used to determine job priorities. For in- 
stance, the jobs of some user may receive faster service at a specific location 
while other jobs are only accepted if sufficient resources are available. 

— Resource Requests: These data specify the resources which are requested 
for a job. Often they include the number and the type of processors, the 
amount of memory as well as some specific hardware and software require- 
ments. Some of these data may be estimates, like the execution time of a 
job, or describe a range of acceptable values, like the number of processors 
for a malleable job. 

— Scheduling Objectives: These data may help the scheduling system to 
generate ‘good’ schedules. For instance, a user may state that she needs the 
result by Sam the next morning while an earlier job completion will be of no 
benefit to her. Other users may be willing to pay more if they obtain their 
results within the next hour. 

Of course other submission data are possible as well. Job submission data 
are entered by the user and typically provided to the system when a job is 
submitted for execution. However, some systems may also allow reservation of 
resources before the actual job submission. Such a feature is especially beneficial 
for multi-site metacomputing [17]. In addition, technical data are often required 
to start a job, like the name and the location of the input data files of the job. 
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But as these data do not affect the schedule if they are correct, we ignore them 
here. Finally note that the submission of erroneous or incorrect data is also 
possible. However, in this case a job may be immediately rejected or fail to run. 

Now, we take a closer look at the schedule. A schedule is an allocation of 
system resources to individual jobs for certain time periods. Therefore, a schedule 
can be described by providing all the time instances where a change of resource 
allocation occurs as long as either this change is initiated by the scheduling 
system or the scheduling system is notified of this change. To illustrate this 
restriction assume a job being executed on a processor that is also busy with some 
operating system tasks. Here, we do not consider changes of resource allocation 
which are due to the context switches between OS tasks and the application. 
Those changes are managed by the system software without any involvement of 
our scheduling system. 

For a schedule to be valid some restrictions of the hardware and the system 
software must be observed. For instance, a parallel processor system may not 
support gang scheduling or require that at most one application is active on a 
specific processor at any time. Therefore, the validity constraints of a schedule 
are defined by the target machine. We assume that a scheduling system does 
not attempt to produce an invalid schedule. However, note that the validity of a 
schedule is not affected by the properties of a submitted job as those properties 
are not guaranteed to comply with submission data. For instance, if not enough 
memory is requested from and assigned to a job, the job will simply fail to run. 
But this does not mean that the resulting schedule is invalid. Also, a schedule 
depends upon other influences which cannot be controlled by the scheduling 
system, like the sudden failure of a hardware component. Therefore, the final 
schedule is only available after the execution of all jobs. 

Next, the scheduling system is divided into 3 parts: 

1. A scheduling policy, 

2. an objective function and 

3. a scheduling algorithm. 

In the rest of this section we first describe these parts separately. Then the 
dependences between them are discussed. Finally, we compare the evaluation of 
scheduling systems with the evaluation of computer architectures. 

2.1 Scheduling Policy 

The scheduling policy forms the top level of a scheduling system. It is defined by 
the owner or administrator of a machine. In general, the scheduling strategy is a 
collection of rules to determine the resource allocation if not enough resources are 
available to satisfy all requests immediately. To better illustrate our approach, 
we give an example: 

Example 1. The department of chemistry at University A has bought a paral- 
lel computer which was financed to a large part by the drug design lab. The 
department establishes the following rules for the use of the machine: 
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1. All jobs from the drug design lab have the highest priority and must be 
executed as soon as possible. 

2. 100 GB of secondary storage is reserved for data from the drug design lab. 

3. Applications from the whole university are accepted but the labs of the 
chemistry department have preferred access. 

4. Some computation time is sold to cooperation partners from the chemical 
industry in order to pay for machine maintenance and software upgrades. 

5. Some computation time is also made available to the theoretical chemistry 
lab course during their scheduled hours. 

Note that these rules are hardly detailed enough to generate a schedule. But 
they allow a fuzzy distinction between good and bad schedules. Also, there may be 
some additional general rules which are not explicitly mentioned, like ‘Complete 
all applications as soon as possible if this does not contradict any other rule’. 
Finally, some conflicts between those rules may occur and must be resolved. For 
instance in Example 1, some jobs from the drug design lab may compete with 
the theoretical chemistry lab course. Hence, in our view a good scheduling policy 
has the following two properties: 

1. It contains rules to resolve conflicts between other rules if those conflicts 
may occur. 

2. It can be implemented. 

We believe that there is no general method to derive a scheduling policy. Also 
there is no need to provide a very detailed policy with clearly defined quotas. 
In many cases this will result in a reduction of the number of good schedules. 
For instance, it would not be helpful at this point to demand that 5% of the 
computation time is sold to the chemical industry in Example 1. If there are only 
a few jobs from the drug design lab then the department would be able to earn 
more money by defining a higher industry quota. Otherwise, the department 
must decide whether to obtain other funding for the machine maintenance or to 
reduce the priority of some jobs of the drug design lab. This issue will be further 
discussed in Section 2.4. 

2.2 Objective Function 

As stated in the previous section the owner of a machine will be able to determine 
whether any given schedule is good or bad. However, it is the goal of a scheduling 
system to consistently produce schedules which are as good as possible. This 
leads to two problems: 

1. It must be demonstrated that the scheduling system will always produce 
good schedules. 

2. It is necessary to provide a ranking among good schedules. 

Problems of the first kind are addressed in theoretical computer science by 
the concept of competitive analysis, see [18]. Unfortunately, this approach is not 
applicable for our scheduling systems for the following reasons: 
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— Often competitive analysis cannot be successfully applied to methods which 
are based on very complex algorithms or which use specific input data sets. 

~ Competitive factors are worst case factors that frequently are not acceptable 
in practice. For instance, a competitive factor of 2 for the machine load of a 
schedule denotes that in some cases 50% of the resources are not used. On 
the other hand, those worst case input data typically do not occur in real 
installations. Frequently, this is also true if randomization is used for the 
analysis. 

Alternatively, a scheduling system can be applied to a multitude of different 
streams of submission data and the resulting schedules can be evaluated. This 
requires a method to automatically determine the quality of a schedule. There- 
fore, an objective function must be defined that assigns a scalar value, the so 
called schedule cost, to each schedule. Note that this property is essential for the 
mechanical evaluation and ranking of a schedule. In the simplest case all good 
schedules are mapped to 0 while all bad schedules obtain the value 1. Most likely 
however, this kind of objective function will be of little help. To derive a suitable 
objective function an approach based on multi criteria optimization can be used, 
see e.g. [20]: 

1. For a typical set of jobs determine the Pareto-optimal schedules based on 
the scheduling policy. 

2. Define a partial order of these schedules. 

3. Derive an objective function that generates this order. 

4. Repeat this process for other sets of jobs and refine the objective function 
accordingly. 

To illustrate Steps 1 and 2 of our approach we consider Rules 1 and 5 of 
Example 1. Assume that both rules are conflicting for the chosen set of job 
submission data. Therefore, we determine a variety of different schedules, see 
Fig. 1. Note that we are not biased toward any specific algorithm in this step. 
We are primarily interested in those schedules which are good with respect to at 
least one criterion. Therefore, at first all Pareto-optimal schedules are selected. 
Those schedules are indicated by bullets in Fig. 1. Next, a partial order of the 
Pareto-optimal schedules is obtained by applying additional conflict resolving 
rules or by asking the owner. In the example of Fig. 1 numbers 0, 1 and 2 have 
been assigned to the Pareto-optimal schedules in order to indicate the desired 
partial order. Here any schedule 1 is superior to any schedule 0 and inferior to 
any schedule 2 while the order among all schedules 1 does not matter. 

The approach is based on the availability of a few typical sets of job data. 
Further, it is assumed that each rule of the scheduling policy are associated with 
single criterion functions, like Rule 4 of Example 1 with the function ‘amount of 
computation time allocated to jobs from the cooperation partners from industry’. 
If this is not the case, complex rules must be split. 

Now, it is possible to compare different schedules if the same objective func- 
tion and the same set of jobs is used. Further, there are a few additional aspects 
which are also noteworthy: 
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Fig. 1. Pareto Example for 2 Rules 



— Schedules can be compared even if they do not have the same target archi- 
tecture. 

— An up front calculation of the schedule cost may not be possible. Most likely, 
it will be necessary to execute all jobs first before the correct schedule cost 
can be determined. 

— No specification of job input data is given. Therefore, it is possible to compare 
two schedules which are based on the same set of jobs but use different job 
submission data. 

Thus, schedules can even be used as a criterion for system selection if desired. 
At many installations of large parallel machines simple objective functions 
are used, like the job throughput, the average job response time, the average 
slowdown of a job or the machine utilization, see [3]. We believe that it cannot 
be decided whether those objective functions are suitable in general. For some 
scheduling policy they may be the perfect choice while they should not be used 
for another set of rules. Also, it is not clear whether the use of those ‘simple’ 
objective functions allows an easier design of scheduling systems. 



2.3 Scheduling Algorithm 

The scheduling algorithm is the last component of a scheduling system. It has 
the task to generate a valid schedule for the actual stream of submission data 
in an on-line fashion. A good scheduling algorithm is expected to produce very 
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Average Response Time 




Fig. 2. On-line versus Off-line Dependence 



good if not optimal schedules with respect to the objective function while not 
taking ‘too much’ time and ‘too many’ resources to determine the schedule. 

Unfortunately, most scheduling problems are computationally very hard. 
This is even true for off-line problems with simple objective functions and few 
additional requirements, see for instance [7]. Therefore, it is not reasonable to 
hope in general for an algorithm that always guarantees the best possible sched- 
ule. In addition, some job data may not be immediately available or may be 
incorrect which makes the task for the algorithm even harder, see Section 2. 

In order to obtain good schedules the administrator of a parallel machine 
is therefore faced with the problem to pick an appropriate algorithm among 
a variety of suboptimal ones. She may even decide to design an entirely new 
method if the available ones do not yield satisfactory results. The selection of 
the algorithm is highly dependent on a variety of constraints: 



— Schedule restrictions given by the system, like the availability of dynamic 
partitioning or gang scheduling. 

— System parameters, like I/O ports, node memory and processor types. 

— Distribution of job parameters, like the amount of large or small jobs. 

— Availability and accuracy of job information for the generation of the sched- 
ule. For instance, this may include job execution time as a function of allo- 
cated processors. 

— Definition of the objective function. 



24 



Jochen Krallmann et al. 



Frequently, the system administrator will simply take scheduling algorithms 
from the literature and modify them to her needs. Then, she picks the best of her 
algorithm candidates. After making sure that her algorithm of choice actually 
generate valid schedules, she also must decide whether it makes sense to look for 
a better algorithm. Therefore, it is necessary to evaluate those algorithms. We 
distinguish the following methods of evaluation: 

1. Evaluation using algorithmic theory 

2. Simulation with job data derived from 

— an actual workload 

— a workload model 

In general theoretical evaluation is not well suited for our scheduling algorithms 
as already discussed in Section 2.2. Occasionally, this method is used to deter- 
mine lower bounds for schedules. These lower bounds can provide an estimate for 
a potential improvement of the schedule by switching to a different algorithm. 
However, it is very difficult to find suitable lower bounds for complex objective 
functions. 

Alternatively, an algorithm can be fed with a stream of job submission data. 
The actual schedule and its cost are determined by simulation with the help of 
the complete set of job data. The procedure is repeated with a large number of 
input data sets. The reliability of this method depends on several factors: 

— Availability of correct job data 

— Compliance of the used job set with the job set on the target machine 

Actual workload data can be used if they are recorded on a machine with a 
user group such that sufficient similarity exists with the target machine and its 
users. This is relatively easy if traces from the target machine and the target user 
community are available. Otherwise some traces must be obtained from other 
sources, see e.g. [1]. In this case it is necessary to check whether the workload 
trace is suitable. This may even require some modifications of the trace. 

Also note that the trace only contains job data of a specific schedule. In an- 
other schedule these job data may not be valid as is demonstrated in Examples 2 
and 3: 

Example 2. Assume a parallel processor that uses a single bus for communica- 
tion. Here, independent jobs compete for the communication resource. Therefore, 
the actual performance of job i depends on the jobs executed concurrently with 
job i. 



Example 3. Assume a machine and a scheduling system that support adaptive 
partitioning. In this case, the number of resources allocated to job i again de- 
pends on other jobs executed concurrently with job i. 



On the Design and Evaluation of Job Scheduling Algorithms 



25 



Also, a comprehensive evaluation of an algorithm frequently requires a large 
amount of input data that available workload traces may not be able to provide. 

If accurate workload data are not available then artificial data must be gen- 
erated. To this end a workload model is used. Again conformity with future real 
job data is essential and must be verified. On the other hand, this approach is 
able to overcome some of the problems associated with trace data simulation, 
if the workload model is precise enough. For a more detailed discussion of this 
subject, see also [3]. 

Unfortunately, it cannot be expected that a single scheduling algorithm will 
produce a better schedule than any other method for all used input data sets. In 
addition the resource consumption of the various algorithms may be different. 
Therefore, the process of picking the best suited algorithm may again require 
some form of multi criteria optimization. 



2.4 Dependences 

The main dependence stream between the components of a scheduling system 
is easy to see: The scheduling policy produces rules which are used to derive 
an objective function. The application of this objective function to a schedule 
yields the schedule cost which allows performance measurements for the various 
algorithms. However, there are also additional dependences. For instance, some 
policy rules may not allow efficient scheduling algorithms, see Example 4. 

Example 4- Assume a machine that does not support time sharing. The schedul- 
ing policy includes the rule: 

Every weekday at 10am the entire machine must be available to a theoretical 
chemistry class for 1 hour. 

The Pareto-optimal schedules used for the determination of the objective 
function show an acceptable (by the owner) amount of idle resources before 
10am. However, as users are not able to provide accurate execution time esti- 
mates for their jobs no scheduling algorithm can generate good schedules. 

Such a situation is shown in Fig. 2 for Example 1. There, it is assumed that 
on-line algorithms cover a significantly smaller area of schedules than off-line 
methods with complete job knowledge. This may require a review of the conflict 
resolving strategy and thus affect the schedule cost. Unfortunately, this on-line 
area of schedules will typically be the result of a combination of several on-line 
algorithms. Therefore, the off-line methods in the approach of Section 2.2 cannot 
be simply replaced by a single or a few on-line algorithms. In addition, a suitable 
on-line algorithm may not be available at this time. 

More of these additional dependences are listed below: 

— Too many or too restrictive policy rules may prevent acceptable schedules 
at all. 

— There may not be sufficient rules to discriminate between good and bad sched- 
ules as some implicitly assumed rules are not explicitly stated. 



26 



Jochen Krallmann et al. 



— While there may be a variety of different objective functions which all sup- 
port the policy rules, a specific objective function may not be suitable as a 
criterion for an on-line scheduling algorithm. 

— The workload model may not be correct if users adapt their submission 
pattern due to their knowledge of the policy rules. 

— The workload model must be modified as the number of users and/or the 
types and sizes of submitted jobs change over time. 

Due to these dependences a few design iterations may be required to deter- 
mine the best suitable scheduling algorithms and/or it may be appropriate to 
repeat the design process occasionally. 

2.5 Comparison 

In this section we briefly compare the evaluation of scheduling systems with 
the well known procedure used for computer architectures. Today, computer 
architectures are typically evaluated with the help of standard benchmarks, like 
SPEC95 or Linpack, see [8]. For instance, the SPEC95 benchmark suite contains 
a variety of programs and frequently, no architecture is the best for all those 
programs. Depending on his own applications the user must select the machine 
best suited for him. This leads to the question whether a similar approach is 
also applicable for scheduling systems. With other words, can we provide a few 
benchmark workloads which are used to test various scheduling systems? 

We claim that this cannot be done in the moment and doubt whether this will 
ever become possible. For computer architectures there is a standard objective 
function: the execution time of a certain job. As we discussed in the previous 
sections each scheduling system has its own objective function. Therefore, we 
cannot really compare two different scheduling systems. On the other hand, the 
comparison of different scheduling algorithms only makes sense if the same ob- 
jective function is used. Hence, the evaluation of scheduling algorithms must be 
based on benchmarks consisting of workloads and objective functions. However, 
it is not clear to us that there will ever be a small set of objective functions that 
will more or less cover all scheduling systems. 

3 Evaluation Example 

In this section we give an example for the design and evaluation of schedul- 
ing algorithms. As the focus is on scheduling algorithms we will assume simple 
scheduling policy rules and only briefly cover the determination of the objective 
function. Also we use a simple machine model. Although the used constraints 
have been taken from real installations it is not the purpose of this paper to 
discuss whether they are appropriate for an installation of a parallel machine. 

Example 5. Assume an Institution B that has just bought a large parallel com- 
puter with 288 identical nodes. The institution has established the following 
policy rules: 
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1. The batch partition of the computer must be as large as possible, leaving a 
few nodes for interactive jobs and for some services. 

2. The user must provide the exact number nodes for each job (rigid job model) 
and an upper limit for the execution time. If the execution of a job exceeds 
this upper limit, the job may be cancelled. 

3. The user is charged for each job. This cost is based on a combination of 
projected and actual resource consumption. 

4. Every user is allowed at most two batch jobs on the machine at any time. 

5. Between 7am and 8pm on weekdays the response time for all jobs should be 
as small as possible. 

6. Between 8pm and Tam on weekdays and all weekend or on holidays it is the 
goal to achieve a high system load. 

The machine supports variable partitioning [2] but does not allow time shar- 
ing. Further, it is required that all batch jobs have exclusive access to their 
partition. 

The administrator decides that 256 nodes can be used for the batch parti- 
tion. She further believes that the user community at the Cornell Theory Center 
(CTC) and at Institution B will be very similar. As the parallel machines at 
the CTC and at Institution B are of the same type she decides to use a CTC 
workload as a basis for the selection of the objective function and the determi- 
nation of a suitable scheduling algorithm. Due to the interdependence between 
user community and scheduling policy this decision also requires knowledge of 
the scheduling policy used at the CTC, see [9]. Only if there is no major dis- 
agreement between the scheduling policies at the CTC and at Institution B the 
profiles of both user communities can be assumed to remain similar. 



4 Determination of the Objective Function 

Next, the administrator must determine an objective function. To this end she 
ignores Rules 1 to 4 because they do not affect the schedule for a specific work 
load or are only relevant to the on-line situation (Rule 2). As Rules 5 and 6 do 
not apply at the same time she decides to consider each rule separately. 

Rule 4 indicates that all jobs should be treated equally independent of their 
resource consumption. Therefore, the administrator uses the average response 
time as objective function for the daytime on weekdays (Rule 5). The average 
response time is the sum of the differences between the completion time and 
submission time for each job divided by the number of jobs. 

For the remaining time (Rule 6) the sum of the idle times for all resources 
in a given time frame seems to be the best choice. 

The administrator intends to independently determine an appropriate sched- 
uling algorithm for each objective function and then to address the combination 
of both algorithms. Note that multi criteria optimization is therefore not neces- 
sary in our simple example. 
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When starting to look for scheduling algorithms the administrator realizes 
that the sum of idle times is based on a time frame. Therefore, it does not support 
on-line scheduling. Using the makespan instead has the advantage that several 
theoretical results are available, see e.g. [5], but again the makespan is mainly an 
off-line criterion [3]. Hence, she decides to use instead the average weighted 
response time where the weight is identical to the resource consumption of a 
job, that is, the product of the execution time and the number of required nodes, 
see [15]. It is calculated in the same fashion as the average response time with 
the exception that the difference between the completion and the submission 
time for each job is multiplied with the weight of this job. In comparison the 
job weight is always 1 for the average response time criterion. Note that for the 
average weighted response time the order of jobs does not matter if no resources 
are left idle [16]. 

5 Description of the Algorithms 

After the objective function has been determined it is necessary to find a suit- 
able scheduling algorithm. Instead of producing an algorithm from scratch it is 
often more efficient to use algorithms from the literature and to modify them if 
necessary. In this first step it is frequently beneficial to consider a wide range 
of algorithms unless previous experiences strongly suggest the use of a specific 
type of algorithm. Further, there may be algorithms which have been designed 
for another objective function but can be adapted to the target function. 

In Example 5 the administrator picks several algorithms from the literature. 
These algorithms are discussed in the following subsections. 



5.1 FCFS 

First- Come- First- Serve (FCFS) is a well known scheduling scheme that is used 
in some production environments. All jobs are ordered by their submission time. 
Then a greedy list scheduling method is used, that is the next job in the list is 
started as soon as the necessary resources are available. This method has several 
advantages: 

1. It is fair as the completion time of each job is independent of any job sub- 
mitted later. 

2. No knowledge about the execution time is required. 

3. It is easy to implement and requires very little computational effort. 

However, FCFS may produce schedules with a relatively large percentage of idle 
nodes especially if many highly parallel jobs are submitted. Therefore, FCFS 
has been replaced by FCFS with some form of backfilling at many locations 
including the CTC. Nevertheless, the administrator does not want to ignore 
FCFS at this time as a theoretical study has recently shown that FCFS may 
produce acceptable results for certain workloads [16]. 
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5.2 Backfilling 

The backfilling algorithm has been introduced by Lifka [10]. It requires knowl- 
edge of the job execution times and can be applied to any greedy list schedule. 
If the next job in the list cannot be started due to a lack of available resources, 
then backfilling tries to find another job in the list which can use the idle re- 
sources but will not postpone the execution of the next job in the list. In other 
words, backfilling allows some jobs down the list to be started ahead of time. 
There are 2 variants of backfilling as described by Feitelson and Weil [4]: 

EASY backfill is the original method of Lifka. It has been implemented in several 
IBM SP2 installations. While EASY backfill will not postpone the projected 
execution of the next job in the list, it may increase the completion time of 
jobs further down the list, see [4]. 

Conservative backfill will not increase the projected completion time of a job 
submitted before the job used for backfilling. On the other hand conservative 
backfill requires more computational effort than EASY. 

However, note that the statements regarding the completion time of skipped jobs 
in the list are all based on the provided execution time for each job. Backfilling 
may still increase the completion time of some jobs compared to FCFS as in an 
on-line scenario another job may release some resources earlier than assumed. In 
this case it is possible that a backfilled job may prevent the start of the next job 
in the list. For instance, while some active job is expected to run for another 2 
hours it may terminated within the next 5 minutes. Therefore, backfilling with 
a job having an expected execution time of 2 hours may delay the start of the 
next job in the list by up to 1 hour and 55 minutes. 

The administrator decides to use both types of backfilling as it is not obvious 
that one method is better than the other. 

5.3 List Scheduling (Garey and Graham) 

The classical list scheduling algorithm by Garey and Graham [6] always starts 
the next job for which enough resources are available. Ties can be broken in an 
arbitrary fashion. The algorithm guarantees good theoretical bounds in some 
on-line scenarios (unknown job execution time) [5], it is easy to implement and 
requires little computational effort. As in the case of FGFS no knowledge of the 
job execution time is required. Application of backfilling will be of no benefit for 
this method. 

5.4 SMART 

The SMART algorithm has been introduced by Turek et al. [21]. The algorithm 
consists of 3 steps: 

1. All jobs are assigned to bins based on their execution time. The upper 
bounds of those bins form a geometric sequence based on a parameter 7. 
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In other words, the bins can be described by intervals of the possible ex- 
ecution time: ]0, 1 ],] 1 , 7 ^],] 7 ^, 7 ^], . . .. The parameter 7 can be chosen to 
optimize the schedule. 

2. All jobs in a bin are assigned to shelves (subschedules) such that all jobs in 
a shelf are started concurrently. To this end the jobs in a bin are ordered 
and then arranged in a shelf as long as sufficient resources are available. 

3. The shelves are ordered using Smith’s rule [19], that is for each shelf the sum 
of the weights of all jobs in the shelf is divided by the maximal execution 
time of any job in the shelf. Finally, those shelves with the largest ratio are 
scheduled first. 

Schwiegelshohn et al. [14] have presented two variants of ordering the jobs in a 
bin and assigning them to shelves (Step 2): 

SMART-FFIA 

1 . The jobs of a bin are sorted according to the product of execution time 
and the number of required nodes, also called area, such that the smallest 
area goes first. 

2. The next job in this list is assigned to the first shelf with sufficient idle 
resources, that is, all shelves of this bin are considered. 

3. If there is no such shelf, a new one is created and placed on top of the 
other shelves of this bin. 

This approach is called the First Fit Increasing Area variant. 
SMART-7VF/VF 

1. All jobs of a bin are ordered by an increasing ratio of the number of 
required nodes to the weight of the job. 

2. The next job in this list is added to the current shelf if sufficient resources 
are available on this shelf. 

3. Otherwise a new shelf is created, placed on top of the current shelf and 
then becomes the current shelf itself. 

This is the Next Fit Increasing Width to Weight variant. 

The SMART algorithm has a constant worst case factor for weighted and un- 
weighted response time scheduling. However, it is an off-line algorithm and can- 
not be directly applied to the scheduling problem of Example 5. It requires a 
priori knowledge of the execution time for all jobs and assumes that all jobs 
are available for scheduling at time 0. Therefore, the administrator modifies the 
SMART algorithm as follows: 

1 . She does not use the SMART algorithm to determine an actual schedule but 
to provide a job order for all jobs already submitted but not yet started. 
Whenever new jobs are submitted the SMART algorithm is started again. 
Based on this order a greedy list schedule is generated, see FCFS. 

2. Instead of the actual execution time of a job the value provided by the user 
at job submission is used. 
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In order to reduce the number of recomputations for the SMART algorithm the 
schedule is recalculated when the ratio between the already scheduled jobs in the 
wait queue to all the jobs in this queue exceeds a certain value. In the example 
a ratio of | is used. The parameter 7 is chosen to be 2. 

As the final schedule is a list schedule the administrator decides to apply 
backfilling here as well. 

5.5 PSRS 

The PSRS algorithm [13] generates preemptive schedules. It is based on the 
modified Smith ratio of a parallel job, that is the ratio of job weight to the 
product of required resources and the execution time of the job. The basic steps 
of PSRS are described subsequently: 

1. All jobs are ordered by their modified Smith ratio (largest ratio goes first). 

2. A greedy list schedule is applied for all jobs requiring at most 50% of the 
machine nodes. If a job needs more than half of all nodes and has been 
waiting for some time, then all running jobs are preempted and the parallel 
job is executed. After the completion of the parallel job, the execution of the 
preempted jobs is resumed. 

Similar to SMART, PSRS is also an off-line algorithm and requires knowledge 
of the execution time of the jobs. In addition it needs support for time sharing. 
Therefore, it cannot be applied to our target machine without modification. 

The off-line problems can be addressed in the same fashion as for the SMART 
algorithm. Further, it is necessary to transfer the preemptive schedule into a 
non-preemptive one. To this end, it is beneficial that a job is not executed con- 
currently with any other job if it causes the preemption of other jobs. 

1. First, 2 geometric sequences of time instances in the preemptive schedule 
are defined, one for those jobs causing preemption (wide jobs) and one for 
all other jobs (small jobs). In both cases the factor 2 is used with different 
offsets. These sequences define bins. 

2. All jobs are assigned to those bins according to their completion time in the 
preemptive schedule. Within a bin the original Smith ratio order is main- 
tained. 

3. A complete order of jobs is generated by alternatively picking bins from each 
sequence and starting with the small job sequence. 

As with SMART the modified PSRS algorithm guarantees a constant approxi- 
mation factor for the off-line case (with and without preemption). 

The administrator decides to apply backfilling to PSRS schedules as well. 

6 Workload 

As already mentioned in Section 3 the administrator wants to base her algorith- 
mic evaluation on workload data from the CTC. In addition she decides to use 
two artificial workloads: 
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1. Artificial workload based on probability distributions 

2. Artificial workload based on randomization 

The number of jobs in each workload is given in Table 1. The reasons for this 
selection are discussed in the following subsections. 

6.1 Workload Trace 

In Section 3 the administrator has already verified that a CTC workload trace 
would be suitable in general. She obtains a workload trace from the CTC batch 
partition for the months July 1996 to May 1997. The trace contains the following 
data for each job: 

— Number of nodes allocated to the job 

— Upper limit for the execution time 

— Time of job submission 

~ Time of job start 

— Time of job completion 

~ Additional hardware requests of the job: amount of memory, type of node, 
access to mass storage, type of adapter. 

— Additional job data like job name, LoadLeveler class, job type, and comple- 
tion status. 

Those additional job data are ignored as they are of no relevance to the 
simulation at this point. But the administrator must address two differences 
between the CTC machine and the parallel computer at her institution: 

1. The CTC computer has a batch partition of 430 nodes while the batch 
partition at Institution B contains only 256 nodes. 

2. The nodes of the CTC computer are not all identical. They differ in type 
and memory. This is not true for the machine at Institution B. 

A closer look at the CTC workload trace reveals that less than 0.2% of all 
jobs require more than 256 nodes. Therefore, the administrator modifies the 
trace by simply deleting all those highly parallel jobs. Further, she determines 
that most nodes of the CTC batch partition are identical (382). Therefore, she 
decides to ignore all additional hardware requests. 

Unfortunately, these modifications will affect the accuracy of the simulation. 
For instance, the simulation time frame of the whole modified CTC workload 
will most likely exceed the time span of the original trace as less resources are 
available. This will result in a larger job backlog during the simulation. There- 
fore, it is not possible to compare the original CTC schedule with the schedules 
generated by simulation. On the other hand, the administrator wants to sepa- 
rately test for two different objective functions, each of which will typically be 
valid for half a day. Hence, the present approach is only suited for a first eval- 
uation of different algorithms. Any parametric fine tuning must be done with a 
better workload. 
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Besides using the CTC workload with the job submission data described 
above the administrator also wants to test her algorithms under the assumption 
that precise job execution times are available at job submission. This simulation 
allows her to determine the dependence of the various algorithms on the accuracy 
of the provided job execution times and the potential for improvement of the 
schedule. For this study the estimated execution times of the trace were simply 
replaced by the actual execution times. 



6.2 Workload with Probability Distribution 

In order to overcome some of the difficulties mentioned in Section 6.1 the admin- 
istrator decides to extract statistical data from the CTC workload trace. These 
data are then used to generate an artificial workload with the same distribution 
as the workload trace. 

An analysis of the CTC workload trace yields that a Weibull distribution 
matches best the submission times of the jobs in the trace. It is difficult to find 
a suitable distribution for the other parameters. Therefore, bins are created for 
every possible requested resource number (between 1 and 256), various ranges 
of requested time and of actual execution length. Then probability values are 
calculated for each bin from the CTC trace. Randomized values are used and as- 
sociated to the bins according to their probability. This generates a workload that 
is very similar to the CTC data set. In the first simulation mainly consistence 
between the results for the CTC and the artificial workload is checked. Once 
this consistence has been demonstrated the artificial workload can be adapted 
to consider the various differences between the CTC and Institution B. 



6.3 Randomized Workload 

Finally, totally randomized data are used as a third input data set. The adminis- 
trator is aware of the fact that this workload will not represent any real workload 
on her machine. But she wants to determine the performance of scheduling al- 
gorithms even in case of unusual job combinations. For the workload, jobs are 
generated with the parameters in Table 2 being equally distributed. 



Workload 


Number of jobs 


CTC 


79,164 


Probability distribution 


50,000 


Randomized 


50,000 



Table 1. Number of jobs in various workloads 
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Submission of jobs 


> 1 job per hour 


Requested number of nodes 


1 - 256 


Upper limit for the execution time 


5 min - 24 h 


Actual execution time 


Is- upper limit 



Table 2. Parameters for randomized job generation 
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Fig. 3. Average Response Time for the Unweighted CTC- Workload 



7 Evaluation Results 

The administrator selects the simulation of FCFS with EASY backfilling to be 
a reference value as this algorithm is used by the CTC. First she compares the 
results for the CTC workload trace, see Fig. 3 and Table 3. For the unweighted 
case she comes to the following conclusions: 

~ All algorithms are clearly better than FCFS even if some form of backfilling 
is used together with FCFS. 

~ PSRS and SMART can be improved significantly with backfilling. 

— The classical list scheduling produces good results but is inferior to the PSRS 
and SMART with backfilling. 

— Conservative backfilling outperforms EASY backfilling when applied to 
PSRS and SMART schedules. 

— There are little differences between PSRS and SMART schedules when back- 
filling is used. 
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The administrator does not give much weight to the absolute numbers as the 
workload trace has been recorded on a machine with 430 nodes while the sim- 
ulations are done for a machine with 256 nodes. Although some highly parallel 
jobs have been removed from the trace a machine with 256 nodes will experience 
a larger backlog which results in a longer average response time. 

In the weighted case as shown in Fig. 4, the results are different: 

— The classical list scheduling algorithms clearly outperforms all other algo- 
rithms. 

— PSRS and SMART can be improved with either form of backfilling but are 
never better than FCFS with EASY. 

— EASY is superior to conservative backfilling. 

— PSRS is slightly better than either form of SMART. 

The artificial workload based on probability distributions basically supports 
the results derived with the CTC workload, see Fig. 5 and Table 4. This seems 
to indicate that the larger backlog in the CTC workload does not significantly 
affect the simulation results. However, it is strange that the absolute values 
for the average response time are even larger than in the CTC workload case 
although the number of jobs in the same time frame is significantly less. The only 
difference to the CTC workload is the fact that EASY is better than conservative 
backfilling if combined with PSRS or SMART in the unweighted case. 

The derived qualitative relationship between the various algorithms is also 
supported by the randomized workload, see Table 5. Therefore, the administrator 
need not worry if a workload will occasionally deviate from her model. 
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Fig. 5. Average Response Time for the Unweighted Probabilistic Workload 



Next the administrator addresses the simulation using the CTC workload 
with exact job execution times, see Fig. 6 and Table 6. By comparing those 
results with the CTC workload simulations (Table 3) she wants to determine how 
much the accuracy of the job execution time estimation affects the schedules. 
This comparisons yields the following results: 

— In the unweighted case the average response time of PSRS and SMART 
schedules can be improved by almost a factor of 2. 

— In the weighted case both forms of backfilling achieve better results than the 
classical list scheduling if applied to FCFS or PSRS schedules. 

— Surprisingly, SMART schedules with backfilling give worse results in the 
weighted case for the CTC workload using the estimated job execution time 
than for the original submission data. 

Finally, the administrator considers the computation time to execute the 
various algorithms for the CTC workload (Table 7) and the artificial workload 
based on probability distributions (Table 8). In both cases similar results are 
obtained with a few observations being noteworthy: 

— It is surprising that the classical list scheduling algorithm requires a similar 
computation time for both workloads while the larger number of jobs in the 
CTC workload results in more computational effort in almost all other cases. 

— In the unweighted case SMART and PSRS together with EASY require 
approximately the same computation time which is significantly less than 
needed by FCFS and EASY. 
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Fig. 6. Comparison of the Average Response Time for Exact vs. Estimated Job 
Execution Length 



~ In the weighted case PSRS and SMART need a significant amount of com- 
putation time. 

Concluding the administrator decides to use the classical list scheduling al- 
gorithm for the weighted case. In the unweighted case the results are not that 
clear. She intends to use either SMART or PSRS together with some form of 
backfilling. However, she wants to execute more simulations to fine tune the pa- 
rameters of those algorithms before making the final decision. In addition she 
must evaluate the effect of combining the selected algorithms. This concludes 
the evaluation example. 

Note that there may be plenty of reasons to consider other algorithms or to 
modify the simulation model. It is only the purpose of this paper to describe a 
method for the systematic design and evaluation of scheduling systems. 



8 Conclusions 

In this paper we have presented a strategy to design scheduling systems for 
parallel processors. This strategy was illustrated in part with the help of an 
example that addressed the following items in particular: 

1 . Determination of an objective function from a given simple set of policy rules 

2. Selection of a several scheduling algorithms from the literature 

3. Modification of the selected algorithms where necessary 

4. Evaluation of the algorithms with the help of real and artificial workloads 
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We want to point out that it is not the goal of this paper to show the superiority 
of any single scheduling algorithm. On the contrary, we believe that there is no 
algorithm that is suited for all scheduling systems. In our view the design of a 
good scheduling system will remain an important task for the administrators or 
owners of large parallel systems. This paper only tries to provide some guidelines. 
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Listscheduler 
sec 1 pet 


Backfilling 
sec 1 pet 


EASY-Backfilling 
sec 1 pet 


Unweighted 

Case 


FCFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


4.91E+06 +1143.0% 
1.59E+05 -59.7% 

1.57E+05 -60.2% 

1.82E+05 -53.9% 

1.46E+05 -63.0% 


6.70E+05 -69.6% 

1.02E+05 -74.2% 

l.OOE+05 -74.7% 

1.02E+05 -74.2% 


3.95E+05 0% 

1.06E+05 -73.2% 
1.17E+05 -70.4% 

l.llE+05 -71.9% 


Weighted 

Case 


FCFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


4.99E+11 +249.0% 
3.82E+11 +167.1% 
3.57E+11 +149.6% 

3.91E+11 +173.4% 

1.20E+11 -16.1% 


1.83E+11 +28.0% 
1.70E+11 +18.9% 
2.00E+11 +39.9% 

2.03E+11 +42.0% 


1.43E+11 0% 

1.43E+11 0% 

1.51E+11 +5.6% 

1.49E+11 +4.2% 



Table 3. Average Response Time for the CTC- Workload 







Listscheduler 
sec 1 pet 


Backfilling 
sec 1 pet 


EASY-Backfilling 
sec 1 pet 


Unweighted 

Gase 


FGFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


6.17E+06 +499.0% 
2.86E+05 -72.2% 

2.67E+05 -74.1% 

2.85E+05 -72.3% 

2.78E+05 -73.0% 


1.06E+06 +2.9% 

1.71E+05 -83.4% 

1.74E+05 -83.1% 

1.65E+05 -84.0% 


1.03E+06 0% 

1.55E+05 -85.0% 
1.57E+05 -84.8% 

1.64E+05 -84.1% 


Weighted 

Gase 


FGFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


6.17E+11 +108.4% 
5.10E+11 +72.3% 
4.84E+11 +63.5% 

4.86E+11 +64.2% 

2.72E+11 -8.1% 


3.03E+11 +2.4% 

3.05E+11 +3.0% 

3.33E+11 +12.5% 

3.31E+11 +11.8% 


2.96E+11 0% 

2.91E+11 -1.7% 
2.97E+11 +0.3% 

3.03E+11 +2.4% 



Table 4. Average Response Time for the Probability Distributed Workload 
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Listscheduler 
sec 1 pet 


Backfilling 
sec 1 pet 


EASY-Backfilling 
sec 1 pet 


Unweighted 

Case 


FCFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


3.40E+08 +96.5% 
1.66E+08 -4.0% 

1.57E+08 -9.2% 

1.61E+08 -6.9% 

1.73E+08 0% 


1.72E+08 -0.6% 

1.44E+08 -16.8% 

1.41E+08 -18.5% 

1.42E+08 -17.9% 


1.73E+08 0% 

1.32E+08 -23.7% 
1.37E+08 -20.8% 

1.39E+08 -19.7% 


Weighted 

Case 


FCFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


9.40E+14 +41.6% 
8.66E+14 +30.4% 
8.15E+14 +22.7% 

9.05E+14 +36.3% 

6.68E+14 +0.6% 


6.66E+14 +0.3% 

6.61E+14 -0.5% 

7.54E+14 +13.6% 

7.96E+14 +19.9% 


6.64E+U 0% 

6.60E+14 -0.6% 
6.96E+14 +4.8% 

7.09E+14 +6.8% 


Table 


5. Avera 


ge Response Time 

Listscheduler 
sec 1 pet 


for the Randomiz 

Backfilling 
sec 1 pet 


sd Workload 

EASY-Backfilling 
sec 1 pet 


Unweighted 

Case 


FCFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


4.91E+06 0% 

1.05E+05 -34.0% 

9.07E+04 -42.2% 

9.39E+04 -48.4% 

1.46E+05 0.0% 


4.05E+05 -39.6% 

6.35E+04 -37.7% 

5.60E+04 -45.1% 

5.66E+04 -44.5% 


3.93E+05 -0.5% 

5.48E+04 -48.3% 
5.33E+04 -49.7% 

5.34E+04 -51.9% 


Weighted 

Case 


FCFS 

PSRS 

SMART- 

FFIA 

SMART- 

NFIW 

Garey& 

Graham 


4.99E+11 0% 

3.91E+11 +2.4% 

3.03E+11 -15.1% 

3.33E+11 -14.8% 

1.20E+11 0.0% 


1.14E+11 -37.7% 

1.15E+11 -32.4% 

2.73E+11 +36.5% 

2.92E+11 +43.8% 


9.82E+10 -31.3% 
9.91E+10 -30.7% 
2.58E+11 +70.9% 

2.68E+11 +79.9% 



Table 6. Average Response Time for the CTC-Workload with Knowledge of 
the Exact Job Execution Time 



42 



Jochen Krallmann et al. 







Listscheduler 

pet 


EASY-Backfilling 

pet 




FCFS 


-81.6% 


0% 


Unweighted 


PSRS 


-76.7% 


-33.7% 


Case 


SMART 

Garey&Graham 


-75.6% 

-58.4% 


-32.7% 




FCFS 


-80.6% 


0% 


Weighted 


PSRS 


+30.6% 


-39.4% 


Case 


SMART 

Garey&Graham 


-13.7% 

-57.2% 


-34.3% 



Table 7. Computation Time for the CTC Workload 







Listseheduler 

pet 


EASY-Baekfilling 

pet 




FCFS 


-92.1% 


0% 


Unweighted 


PSRS 


-88.5% 


-79.6% 


Case 


SMART 

Garey&Graham 


-87.1% 

-72.3% 


-80.1% 




FCFS 


-91.6% 


0% 


Weighted 


PSRS 


-27.2% 


-57.4% 


Case 


SMART 

Garey&Graham 


-50.5% 

-69.2% 


-72.7% 



Table 8. Computation Time for the Probability Distributed Workload 
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Abstract. We present a multivariate analysis technique called Co-plot that is 
especially suitable for samples with many variables and relatively few 
observations, as the data about workloads often is. Observations and variables 
are analyzed simultaneously. We find three stable clusters of highly correlated 
variables, but that the workloads themselves, on the other hand, are rather 
different from one another. Synthetic models for workload generation are also 
analyzed, and found to be reasonable; however, each model usually covers well 
one machine type. This leads us to conclude that a parameterized model of 
parallel workloads should be built, and we describe guidelines for such a model. 
Another feature that the models lack is self-similarity: We demonstrate that 
production logs exhibit this phenomenon in several attributes of the workload, 
and in contrast that the none of the synthetic models do. 



1. Introduction 

A notion of the workload a system will face is necessary in order to evaluate 
schedulers, processor allocators, or make most other design decisions. Two kinds of 
workloads are typically used: A trace of a real production workload, or the output of a 
synthetic statistical model. 

Production logs have the advantage of being more realistic, and abstain as much as 
possible from making assumptions about the modeled system. At least three 
assumptions, however, are always there. First, we believe that we can draw 
conclusions from past workloads and learn about future ones. Second, we believe that 
we can infer from one installation - one scheduler, users set, and hardware 
configuration - about other ones. And third, we believe that the log contains no errors. 

In order to use a production log as a model, we must answer yes to all three 
questions. In reality, the third issue - correctness of the log - is almost always 
questioned by mysterious jobs that exceeded the system's limits, undocumented 
downtime, dedication of the system to certain users, and other 'minor' undocumented 
administrative changes which distort the users' true wishes [6,15]. The first two 
assumptions - similarity between configurations and along time - will be shown in 
this paper to be generally unjustifiable as well. 
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Since using production traces suffers from apparent severe problems, researchers 
have turned to the other alternative, and have offered several synthetic models of 
parallel workloads [4,5,7,14,20]. The idea of basing models on measurements is not 
new [1,11], but practical work in the area was very little [3] until 1996. Models have 
advantages over production logs by putting all the assumptions "on the table", and hy 
allowing their user to easily vary the model’s parameters in order, for example, to 
generate a workload for a machine with a given number of processors. 

The problem with models is, of course, the need to correctly build them. This paper 
compares ten production workloads with the generated output of five synthetic 
models, and maps each model to the production environment to which it is closest. 
But we also go further, and try to identify desired properties of parallel workload 
models for the models yet to come: Which variables should be modeled, and which 
should not? What can we tell about the distribution of these variables? What can we 
tell about the correlation of these variables and the unmodeled ones? Should we create 
a single model, or a customizable one? And how should the model be altered to match 
a needed number of processors, the load, and so forth? 

In order to reach the answers, we use a new statistical method called Co-Plot, 
which is tailored for situations in which few observations but many variables about 
them are available. Section 2 presents Co-plot, and section 3 presents the data set used 
for the analysis. Sections 4, 5 and 6 analyze the production logs from several angles, 
which allows for a comparison with synthetic models available today in section 7, and 
a discussion about the implications in section 8. Section 9 deals with self-similarity, 
which is shown to differentiate between the production logs and the synthetic models. 



2. Co-Plot 

Classical multivariate analysis methods, such as cluster analysis and principal 
component analysis, analyze variables and observations separately. Co-Plot is a new 
technique which analyzes them simultaneously. This would mean, for example, that 
well be able to see, in the same analysis, clusters of observations (workloads in our 
case), clusters of variables, the relations between clusters (correlation between 
variables, for example) and a characterization of observations (as being above average 
in certain variables and below in others). The technique has been used before mostly 
in the area of economics [18,23]. 

Co-plot is especially suitable for tasks in which there are few observations and 
relatively many variables - as opposed to regression based techniques, in which the 
number of observations must be an order of magnitude larger than the number of 
variables. This is crucial in our case, in which there are few workloads (ten production 
ones and five synthetic ones), and just as many variables. 

Co-plot's output is a visual display of its findings. It is based on two graphs that are 
superimposed on each other. The first graph maps the n observations into a two- 
dimensional space. This mapping, if it succeeds, conserves distance: observations that 
are close to each other in p dimensions are also close in two dimensions, and vice 
versa. The seconds graph consists of p arrows, representing the variables, and shows 
the direction of the gradient along each one. 
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Given an input matrix Ynxp of p variable values for each of n observations (see for 
example Table 1), the analysis consists of four stages. The first is to normalize the 
variables, which is needed in order to be able to relate them to each other, although 
each has different units and scale. This is done in the usual way. If Yj is the j’th 
variable’s mean, and Dj is its standard deviation, then Ty is normalized into Zy by: 

Zr.= {Y.,-Yi)ID, ( 1 ) 

In the second stage, we choose a measure of dissimilarity 5',* > 0 between each pair 
of observations (rows of Znxp). A symmetric nxn matrix is produced from all the 
different pairs of observations. To measure Sih, we use city-block distance - the sum 
of absolute deviations - as a measure of dissimilarity: 

5.k = ^ |Z>j - Zkj| (2) 

j=i 

In stage three, the matrix Suc is mapped by means of a multidimensional scaling 
(MDS) method. Such an algorithm maps the matrix Suc into an Euclidean space, of 
two dimensions in our case, such that 'close' observations (with a small dissimilarity 
between them) are close to each other in the map, while 'distant' ones are also distant 
in the map. Formally the requirement is as follows. Consider two observations, i and 
k, that are mapped a distance of dik from each other. We want this to reflect the 
dissimilarity 5',*. But this is actually a relative measure, and the important thing is that: 

Sik ^ Sim iff dik ^ dim 

The MDS we use is Guttman's Smallest Space Analysis, or SSA [12]. SSA uses the 
coefficient of alienation 0 as a measure of goodness-of-fit. The smaller it is, the better 
the output, and values below 0.15 are considered good. The intuition for 0 comes 
directly from the above MDS requirement: A success of fulfilling it implies that the 
product of the differences between the dissimilarity measures and the map distances 
are positive. In a normalized form, we define: 

-d,J 

jj^ = (3) 

^ \^ik ^ ^Im l^ik ~ ^Im \ 



Thus |J, can attain the maximal value of 1. This is then used to define 0 as follows: 



0=VTv 



( 4 ) 



The details of the SSA algorithm are beyond the scope of this paper, and presented 
in [12]. It is a widely used method in social sciences, and several examples along with 
intuitive descriptions can be found in [21]. 

In the fourth stage of the Co-plot method, p arrows are drawn on the Euclidean 
space obtained in the precious stage. Each variable j is represented by an arrow j. 
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emerging from the center of gravity of the n points. The direction of each arrow is 
chosen so that the correlation between the actual values of the variable j and their 
projections on the arrow is maximal (the arrows’ length is undefined). Therefore, 
observations with a high value in this variable should be in the part of the space the 
arrow points to, while observations with a low value in this variable will be at the 
other side of the map. 

Moreover, arrows associated with highly correlated variables will point in about 
the same direction, and vice versa. As a result, the cosines of angles between these 
arrows are approximately proportional to the correlations between their associated 
variables. 

The goodness-of-fit of the Co-plot technique is assessed by two types of measures, 
one for stage 3 and another for stage 4. In stage 3, a single measure - the coefficient 
of alienation in our case - is used to determine the quality of the two-dimensional 
map. In stage 4, p separate measures - one for each variable - are given. These are the 
magnitudes of the p maximal correlations, that measure the goodness of fit of the p 
regressions. These correlations help in deciding whether to eliminate or add variables: 
Variables that do not fit into the graphical display, namely, have low correlations, 
should in our opinion be removed. Therefore, there is no need to fit all the 2’’ subsets 
of variables as in other methods that use a general coefficient of goodness-of-fit. The 
higher the variable's correlation, the better the variable's arrow represents the common 
direction and order of the projections of the n points along the axis it is on. 



3. The Data Set 

Over several years we have obtained both a set of production workloads and the 
source codes to generate a number of synthetic workloads. As part of the current 
study, all workloads were translated to the standard workload format, and are freely 
available to all researchers in the field from the parallel workloads archive at 
URL http://www.cs.huji.ac.il/labs/parallel/workload. We would also like to encourage 
others to produce logs and source codes whose output is in this format, in order to 
create a growing library of quickly accessible and reliable data, that would ease 
validating a research on many workloads at once. 

Traces of real production workloads were available from six machines: The NASA 
Ames iPSC/860 [9,24], the San Diego Supercomputing Center Paragon [24], the 
Cornell Theory Center SP2 [13], The Swedish Institute of Technology SP2, the Los 
Alamos National Lab CM-5, and the Lawrence Livermore National Lab Cray T3D. 
The Los Alamos and San Diego logs are displayed as three observations: The entire 
log, the interactive jobs only, and the batch jobs only. This gives a total of ten 
observations of production workloads. The characteristics of these workloads are 
given in table 1 . 

As Co-plot encourages it, the logs were tested for as many variables (attributes of 
the workloads) as possible. The following variables were measured for each 
workload: 




Comparing Logs and Models of Parallel Workloads Using the Co-plot Method 47 



1. The number of processors in the system. 

2. Scheduler Flexibility. There were essentially three schedulers in this sample: 
the NQS batch queuing system, the EASY scheduler which uses backfilling, and 
gang schedulers. We ranked them in this ascending order, from 1 to 3. 

3. Processor Allocator Flexibity. There were again three ranks, in this order of 
increasing flexibility: Allocation of partitions with power-of-2 nodes, limited 
allocation (meshes, etc.), and unlimited allocation (where any arbitrary subset of 
the nodes can be used). 

4. Ruutime Load, or the percent of available node seconds that were actually 
allocated to jobs. This is calculated as the sum of runtime multiplied by number 
of processors over all jobs, divided by the product of the number of processors in 
the machine multiplied by the log duration. 

5. CPU Load, which is the percent of actual CPU work out of the total available 
CPU time during the log’s lifetime. CPU times are the part of runtime in which 
the job actually processed; this however was missing from two workloads, and 
its definition is vague in some of the others, so focus was given to the runtime 
load. 

6. Normalized number of executables. As some logs are much longer than others, 
it is not surprising that more executables are represented in them. We therefore 
normalize by dividing the number of observed executables by the total number 
of jobs in the log. A lower number indicated more repeated requests to run the 
same executable. 

7. Normalized number of users. As above. 

8. Percent of successfully completed jobs. 

9. Median and 90% interval of the distribution of runtimes of jobs. 

10. Median and 90% interval of the distribution of degree of parallelism, i.e. the 
number of processors used by each job. 

11. Median and 90% interval of the distribution of normalized degrees of 
parallelism. This gives the number of processors that would be used out of a 
128-processor machine. It is calculated as the percent of available processors 
that jobs used, multiplied by 128. It enables the decoupling of conclusions about 
the effect of machine size and the effect of parallelism. 

12. Median and 90% interval of the distribution of total CPU work (over all 
processors of the job). 

13. Median and 90% interval of the distribution of inter-arrival times. 

As shown in [6], the average and standard deviation of these fields are extremely 
unstable due to the very long tail of the involved distributions. Removing the 0.1% 
'tally' jobs from a workload, for example, could change the average by 5% and the CV 
by 40%. These findings follow similar ones in [16], and mean that the very big jobs 
must never be removed from workloads as outliers. On the other hand, most errors in 
traces are often in this area of the distribution (what do you do with a job that lasted 
more than the system allows?). Therefore, it is preferable to use order moments, such 
as the median and intervals. In this case, the 90% interval - difference between the 
95% and 5% percentiles - was used; the 50% interval was also tested, and gave 
virtually the same results. 
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Using normalized degree of parallelism is preferable whenever possible, since it 
enables the decoupling of conclusions about the effect of machine size and the effect 
of parallelism. In our case we treat jobs as if they requested from a 128-node machine. 



Variable 




CTC 


KTH 


LANL 


LANL 

inter. 


LANL 

batch 


LLNL 


NASA 


SDSC 


SDSC 

inter. 


SDSC 

batch 


Machine processors 


MP 


512 


100 


1024 


1024 


1024 


256 


128 


416 


416 


416 


Scheduler flexibility 


SF 


2 


2 


3 


3 


3 


3 


1 


1 


1 


1 


Allocation flexibility 


AL 


3 


3 


1 


1 


1 


2 


1 


2 


2 


2 


Runtime load 


RL 


0.56 


0.69 


0.66 


0.02 


0.65 


0.62 


N/A 


0.7 


0.01 


0.69 


CPU load 


CL 


0.47 


0.69 


0.42 


0 


0.42 


N/A 


0.47 


0.68 


0.01 


0.67 


Norm. Executables 


E 


N/A 


N/A 


0.0008 


0.0019 


0.0012 


0.0329 


0.0352 


N/A 


N/A 


N/A 


Norm. Users 


U 


0.0086 


0.0075 


0.0019 


0.0049 


0.0032 


0.0072 


0.0016 


0.0012 


0.0021 


0.0029 


% Completed jobs 


C 


0.79 


0.72 


0.91 


0.99 


0.85 


N/A 


N/A 


0.99 


1.00 


0.97 


Runtime median 


Rm 


960 


848 


68 


57 


376.00 


36 


19 


45 


12 


1812 


Runtime interval 


Ri 


57216 


47875 


9064 


267 


11136 


9143 


1168 


28498 


484 


39290 


Processors median 


Pm 


2 


3 


64 


32 


64.00 


8 


1 


5 


4 


8 


Processors interval 


Pi 


37 


31 


224 


96 


480.00 


62 


31 


63 


31 


63 


Norm. proc. Median 


Nm 


0.76 


3.84 


8.00 


4.00 


8.00 


4.00 


1.00 


1.54 


1.23 


2.46 


Norm. proc. Interval 


Ni 


14.10 


39.68 


28.00 


12.00 


60.00 


31.00 


31.00 


19.38 


9.54 


19.38 


CPU work median 


Cm 


2181 


2880 


256 


128 


2944 


384 


19 


209 


86 


9472 


CPU work interval 


Ci 


326057 


355140 


559104 


2560 


1582080 


455582 


19774 


918544 


3960 


175421 

2 


Inter-arrival median 


Im 


64 


192 


162 


16 


169 


119 


56 


170 


68 


208 


Inter-arrival interval 


li 


1472 


3806 


1968 


276 


2064 


1660 


443 


4265 


2076 


5884 



Table 1. Data of production workloads 

Since not all workload traces had all the required fields, missing values were 
approximated. These are all the assumptions that were made; 

1. If one of CPU load and runtime load were missing, the other of the two fields was 
used. This was done in the NASA and LLNL workloads. 

2. If the submit time of jobs was not known but the time the job started running (after 
possibly being in a queue) was, the inter-arrival time was based on this start time. 
This was necessary in NASA, LLNL, and the interactive workloads. 

3. In the NASA log, total CPU work wasn’t given and was approximated by the 
product of runtime and degree of parallelism. In the LLNL log, the opposite was 
done: The runtime was approximated by the total work divided by parallelism. 

The synthetic models usually only offer values for the inter-arrival times, runtimes 

and degree of parallelism. They were only compared to the production workloads in 
these fields, of course, and the remaining variables were discarded. 
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4. Production Workloads: Variables 

Running Co-plot on all variables resulted in some of them having low correlations, 
and they were removed. These variables were the number of processors in the 
machine, the scheduler flexibility, the normalized number of users, the normalized 
number of executables, and the percent of completed jobs. This basically means that 
these attributes of a workload neither support nor refute the information derived from 
other variables. They are either irrelevant or belong to a different explanation 
universe. 

Two other variables, the CPU load and the processor allocation flexibility, were 
also removed from the final map (Figure 1), but will still be analyzed. The 
correlations of these variables was slightly lower than that of the others, therefore 
removing them improved the output (in the goodness of fit sense), but we can still 
deduce about them from their would-be direction if they weren’t removed. To 
represent the degree of parallelism, the normalized variant was use, although the un- 
normalized one gives almost exactly the same result. As discussed in the previous 
section, in such a case the normalized variant is preferable. The map in Figure 1 has a 
coefficient of alienation of 0.07 and an average of variable correlations of 0.88 with a 
minimum of 0.83. These are generally considered as excellent goodness of fit values. 



LANLb 




Fig. 1. Co-plot output of all production workloads 
See Table 1 for variable signs 
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First, it is clear that there are four clusters of highly correlated variables. 
Clockwise: 

1 . The median and interval of the normalized degree of parallelism. 

2. The median of inter-arrival times, interval of total CPU work, and runtime load. 

3. The median of total CPU work and the interval of inter-arrival times. The CPU 
load (uncharted here) also belongs to this cluster. 

4. The median and interval of job runtimes. The processor allocation flexibility 
(uncharted here) also belongs to this cluster. 

It should be noted, however, that in some of the other runs (with more variables 
included, or some workloads excluded), the third cluster disappears: The CPU work 
median (Cm) joins the fourth cluster, and the inter-arrival times interval (li) joins the 
second. 

The map tells a lot about the types of distributions that should be used for 
modeling workloads. Both runtimes and the degree of parallelism exhibit a high 
positive correlation between the median and interval of the distribution. This means 
that systems that allow higher runtimes and parallelism also exhibit a more varied 
stream of requests. This probably occurs because common administrative tools, such 
as limiting the maximal runtime, affect both the median and interval of the observed 
runtime in the same way. We suspect these phenomena to be linked to administrative 
constraints in general, but this could not be quantitatively checked. 

Inter-arrival times and total CPU work, however, require a different treatment. 
Although their median and interval are positively correlated, the correlation is far 
from being full. This result repeats for other analyses for these two variables. Since 
none of the synthetic models published so far model runtime and actual CPU time 
separately, the focus has been put on runtimes. 

The fact that the processor allocation flexibility measure and the median and 
interval of runtimes fall in the same cluster hints that systems which are more flexible 
in their allocation attract, on average, longer jobs. Note that the scheduler, in contrast, 
was not found to have a significant effect on the other variables. This is a first move 
towards a highly needed research about the changes in users’ requests due to their 
system's constraints - users learn the system over time, and change their behavior 
accordingly. When looking at production workloads, we only see the distorted 
picture, not the "true workload" by which we wish to design future systems [15]. 

Apart from defining variable clusters, we can also infer about the correlation 
between clusters. The first cluster, for example, is positively correlated with the 
second cluster, has a small negative correlation with the third, and a strong negative 
one with the fourth. The second cluster is positively correlated with the third cluster, 
but not correlated with the fourth one. The third and fourth clusters are positively 
correlated. 

One should be careful not to misinterpret these findings - note that they relate to 
whole workloads, not jobs. For example, the negative correlation between runtime 
(fourth cluster) and degree of parallelism (first cluster) means that systems with high 
average parallelism exhibit lower average runtimes, not that jobs that use many 
processors are shorter (in fact, the opposite of this was demonstrated in [6,10]). The 
reason for this may, for example, be that systems with more processors tend to 
enforce tighter runtime limits on jobs. This is merely a hypothesis, triggered by the 
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observation that systems with fewer processors often try to compensate for this by 
offering more flexible policies. 

Another significant finding from the correlations between variable clusters is that 
the processor allocation flexibility of a system is positively correlated to the median 
of the total CPU work done in it, and is uncorrelated to its runtime load. Or, in short, 
systems whose allocation schemes are more flexible allow bigger jobs to run without 
affecting the average load. In contrast, neither the scheduler nor the number of 
processors in the system seems to have such an effect. 

There is more data to be studied from the correlations between the second, third 
and fourth cluster, but since as mentioned before the third cluster sometimes melts 
into the other two, any such conclusions are dangerous. Only stable findings are 
reported. 



5. Production Workloads; Observations 

Two workloads in Figure 1 seem to be outliers, which ’stretch’ the map in order to 
accommodate them - the batch jobs of LANL and SDSC, marked 'LANLb' and 
'SDSCb' respectively. The LANL batch jobs are way above average in the normalized 
number of used processors, and also exhibit high inter-arrival times combined with 
low runtimes and runtime intervals. The high degree of parallelism is probably a 
result of the fact that the system had static partitions, all powers of two, of which the 
smallest one has 32 processors. The SDSC workload has very long runtime and total 
CPU work averages, and an extreme interval of inter-arrival times as well. The SDSC 
is also relatively inflexible in its processor allocation. 

In order to understand the relations between the other workloads, and get another 
picture of the variables arena, another analysis is presented here that includes the 
same workloads as those used in figure 1, but without the two batch workloads. The 
variables are also the same, except the degree of parallelism which is now not 
normalized - the normalized variables had too low correlations. The Figure 2 map has 
a coefficient of alienation of 0.01 and an average of variable correlations of 0.88. 

Comparing the variable clusters here to those of Figure 1 shows that what was 
there the third cluster indeed broke; the interval of inter-arrival times variable joined 
the second cluster, and the median of CPU work variable joined the fourth. All our 
conclusions remain the same. 

At the lower left part of the map are the two interactive workloads, again of LANL 
and SDSC respectively. Along with the NASA Ames log*, these three workloads 
seem to form the only natural observations cluster in this map: All other observations 
are evenly spread out across the map. This means that apart from a grouping of the 
interactive workloads - based on two observations only - the workloads exhibited by 
different systems are very different from one another. 

Since Co-plot analyzes observations and variables together, it is not only possible 
to see clusters of observations, but also to identify their nature. For example, the 



* A caveat is that 57% of the jobs in the NASA log were small jobs used to periodically check 
the system availability, which obviously causes a strong bias towards low varuable values. 
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interactive jobs are characterized by being way below average on all variables: They 
have a shorter average inter-arrival time, and also shorter runtimes and degrees of 
parallelism. Such facts are deduced in Co-plot from the arrows: The projection of a 
point (workload, in our case) on a variable’s arrow should be proportional to its 
distance from the variable’s average, where above average is in the direction of the 
arrow and vice versa. 



LANL 




Fig. 2. Production workloads of all except the batch workloads 



In the same manner, the CTC workload (in the lower right side of the map) has 
very long runtimes but little parallelism, while the LANL workload (upper left) has a 
very high degree of parallelism, but below average runtimes. The LLNL workload 
seems to be the average - it is very close to the center of gravity in almost all 
variables. 

Note that although we can see that the workloads are ‘far’ from each other, this 
notion of distance is always relative to the other observations in this analysis. This 
happens because all variables must be normalized - otherwise we can’t compare 
relations between them - and means that we should beware if attaching ‘real’ distance 
to Co-plot’s output. 
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6. Production Workloads Over Time 

Logs of production workloads from previous years are typically used as a model of 
the pattern of requests for next year. But since users learn the system as time passes 
and adjust their behavior accordingly, administrators fine-tune the system 
continuously, and the dominant projects on machines change every few months, it is 
unclear whether this is justified. 

Co-plot allows us to test just that, hy mapping several consecutive periods of 
logged work on the same machine, and mapping them together with the other 
workloads. If past workloads were indeed good indications of the near future, we 
would expect consecutive workloads on the same machine to be mapped close to each 
other. Only two workloads in our sample were long enough to test this - the LANL 
and the SDSC logs were each divided to four periods of six months each. The data for 
these partial logs is given in Table 2. 





LANL 


SDSC 


10/94-3/95 


4/95-9/95 


10/95-3/96 


4/96-9/96 


1/95-6/95 


7/95-12/95 


1/96-6/96 


7/96-12/96 


Runtime Load 


0.76 


0.83 


0.24 


0.73 


0.66 


0.67 


0.76 


0.65 


CPU Load 


0.43 


0.52 


0.16 


0.48 


0.65 


0.66 


0.72 


0.63 


Executables per Job 


0.0016 


0.0014 


0.0034 


0.0016 


N/A 


N/A 


N/A 


N/A 


Users per Job 


0.0038 


0.0038 


0.0076 


0.0042 


0.0021 


0.0019 


0.0023 


0.0023 


% of Completed Jobs 


0.93 


0.93 


0.82 


0.90 


0.99 


0.99 


0.98 


0.97 


Runtime Median 


62 


65 


643 


79 


31 


21 


73 


527 


Runtime Interval 


7003 


7383 


11039 


11085 


29067 


20270 


30955 


25656 


Processors Median 


64 


32 


64 


128 


4 


4 


4 


8 


Processors Interval 


224 


224 


480 


480 


63 


63 


63 


63 


Norm. Procs. Median 


8 


4 


8 


16 


1.23 


1.23 


1.23 


2.46 


Norm. Procs. Interval 


28 


28 


60 


60 


19.38 


19.38 


19.38 


19.38 


CPU Work Median 


128 


256 


7648 


384 


169 


119 


295 


1645 


CPU Work Interval 


300320 


394112 


1976832 


1417216 


504254 


612183 


1235174 


1141531 


Inter-Arrival Median 


159 


167 


239 


89 


180 


39 


92 


206 


Inter-Arrival Interval 


1948 


1765 


2448 


1834 


2422 


5836 


4516 


5040 



Table 2: Data of production workloads divides to six months 



Figure 3 includes the same workloads as Figure 1, except the addition of the eight 
new workloads. The four parts of the LANL workload are marked LI through L4, and 
the four parts of the SDSC workload are marked SI through S4. The full, interactive 
and batch workloads of both sites were also kept. Two variables were removed 
because of low correlation: The runtime load, and the interval of the inter-arrival 
times. This does not mean that they shouldn't be used in general - it may very well be 
the case that they do not fit well only with the LANL or SDSC logs, which are 
together 14 out of the 18 observations in Figure 3. 

It is clear that the SDSC jobs are rather clustered, apart possibly from the last 
workload S4, which has slightly higher runtimes, degrees of parallelism, and inter- 
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arrival times. The original full SDSC workload is some kind of average of its four 
parts, as was expected. On the other hand, the LANL workloads have a quiet first year 
(workloads LI and L2, close to the original full LANL workload), but the second year 
is wildly different with L3 and L4, which are definite outliers. 



SDSCb 




Fig. 3. Production workloads change over time 

A clarification with Curt Canada of LANL indeed revealed that at the end of 1995 
there was a significant change of usage of the CM-5. It approached the end of its life 
for grand challenge jobs, and only a couple of groups remained on the machine for 
special projects that were trying to finish. Fewer jobs of fewer users, mostly very long 
ones, were run in 1996. 

Co-Plot could be used in this manner to test any new log, by dividing it into several 
parts and mapping it with all the other workloads. This should tell whether the log is 
homogeneous, and whether it contains time intervals in which work on the logged 
machine had unusual patterns. 
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7. Synthetic Workloads 

Having taken a serious look at the raw data from production logs, we now turn to 
inspect what research has done so far, namely the synthetic models that are currently 
available. To make a long story short, the models are quite good - none is an 
outrageous outlier - but each one is more representative of one or two production logs 
than of the others. Usually every researcher based his or her model on one log, and the 
model reflects this. 

The five synthetic models available are the following. The first model was 
proposed by Feitelson in '96 [7]. This model is based on observations from several 
workload logs. Its main features are the hand-tailored distribution of job sizes (i.e. the 
number of processors used by each job), which emphasizes small jobs and powers of 
two, the correlation between job size and running time, and the repetition of job 
executions. In principle such repetitions should reflect feedback from the scheduler, 
as jobs are assumed to be re-submitted only after the previous execution terminates. 
Here we deal with a pure model, so we assume they run immediately and are 
resubmitted after their running time. The second model is a modification from '97 [8]. 

The model by Downey is based mainly on an analysis of the SDSC log [4,5]. It 
uses a novel log-uniform distribution to model service times (that is, the total 
computation time across all nodes) and average parallelism. This is supposed to be 
used to derive the actual runtime based on the number of processors allocated by the 
scheduler. Again, as we are dealing with a pure model here, we instead use the 
average parallelism as the number of processors, and divide the service time by this 
number to derive the running time. 

Jann's model is based on a careful analysis of the CTC workload [14]. Both the 
running time and inter-arrival times are modeled using hyper Erlang distributions of 
common order, where the parameters for each range of number of processors are 
derived by matching the first 3 moments of the distribution. 

The last model, by Lublin [20], is based on a statistical analysis of 4 logs. It 
includes a model of the number of processors used which emphasizes powers of two, 
a model of running times that correlates with the number of processors used by each 
job, and a model of inter-arrival times. While superficially similar to the Feitelson 
models, Lublin based the choice of distributions and their parameters on better 
statistical procedures in order to achieve a better representation of the original data. 

Figure 4 is the Co-Plot output of all the production workloads and the five 
synthetic models that were tested. Since all models only model the inter-arrival times, 
runtimes, and degrees of parallelism, then the median and interval of each of these 
along with the implied used CPU times (runtimes multiplied by the degree of 
parallelism) were the only eight variables that could be used. All eight showed high 
correlations to the map; the average correlation is 0.89 and the coefficient of 
alienation is 0.06. 

First, Uri Lublin's model places itself as the ultimate average. This result repeated 
in analyses under different variables and observations. So this model represents "the 
workload on the street", and when used to compare schedulers, for example, we can 
be sure that results will not be distorted by one out-of-line feature of the workload (of 
the variables analyzed here). However, most of the production workloads do have 
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such out-of-line features, and are far from the center of gravity. Only the LLNL 
workload is close enough as to accept the model as a match. 

Downey’s model, as well as the two versions of Feitelson’s model, match well the 
interactive workloads and the NASA one. This is probably due to the fact that the 
NASA log was the first to be published and seriously analyzed, and had a major effect 
on the creation of the earlier (e.g. 1997) models. In order to try to differentiate the 
three models, the batch workloads, which were the outliers removed in Figure 2 as 
well, were removed and the analysis was re-run. The result was essentially the same, 
with a "zoom in" on the lower left part of Figure 4. Feitelson’s 1997 model remained 
the closest to the interactive and NASA workloads, while his 1996 model was closer 
to the center of gravity and Downey’s model was further out from it. 




Jann’s model was designed specifically to match the CTC workload. It is indeed the 
closest to it, but is also close to the KTH workload. The CTC and KTH offer very 
similar environments: Both are IBM SP2 machines, with slightly different versions of 
the RS/6000 processor, using LoadLeveler with EASY for scheduling, and offering a 
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totally flexible processor allocation scheme. With a limited warranty - since only two 
observations support this - it seems that Jann's model is appropriate for this kind of 
environment. 

The LANL and SDSC workloads have no model close to them, and the batch 
workloads of these two systems are still lonely outliers as well. This means that no 
workload model as it stands today models well the heavier batch jobs that these large 
machines see. The focus seems to be the interactive jobs, which, according to Table 1, 
provide only a fraction of the total load a parallel system has to handle. 

A quick look at the variable arrows of Figure 4 is also worthy. It is almost the same 
as that of Figure 1, which is good news: The synthetic workload models do not distort 
the 'real world' picture by assuming wildly incorrect distributions or correlations. 



8. Implications for Modeling 

From what we have seen so far, three major conclusions concerning the correct 
modeling of parallel workloads can be stated: 

1 . Model the runtime and degree of parallelism by a distribution that has an almost 
full correlation between the median and interval. A high positive such correlation 
also exists for the inter-arrival time. 

2. A single model cannot truly represent all systems. It is better to parametrize by 
three variables: The medians of total CPU work, degree of parallelism, and inter- 
arrival time. 

3. In order to alter the average load of a modeled workload, do not use any of the 
common techniques of multiplying the inter-arrival time, runtimes, or parallelism 
by a factor, or changing the lambda of an exponential distribution. 

The first statement is supported by the map of Figure 1 , which says much about the 
types of distributions that should be used for modeling workloads. Both runtimes and 
the degree of parallelism are characterized by the fact that the median and interval of 
the distribution are highly correlated. An analogous feature is found in the exponential 
distribution, in which the mean and variance are equal. Although the exponential 
distribution lacks the very long tail property workload modeling requires, variants 
such as two- and three-stage hyper-exponential distributions have been used in several 
recent models [7,20], which seems to be rationalized here. 

Using an exponential distribution for the inter-arrival time is also popular, but it 
seems that although the correlation between the median and interval of it is positive, it 
is not full and a more sophisticated model is required. 

The second statement is clear from Figure 2. Apart from the NASA workload 
being similar to the interactive ones, all the workloads are scattered across the space 
that they define. While Lublin’s model represents the average, it does not closely 
represent any single model. But there are good news as well: Beyond telling us that a 
generic model must be a parameterized one, the Co-plot method also helps us find 
upon which variables we should parameterize. We should take one representative 
from each variables cluster, such that the representatives conserve the previously 
known map, and that their correlation is highest. 
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In our case, the best results (with a coefficient of alienation of 0.02 and an average 
correlation of 0.94) were achieved by using the processor allocation flexibility and the 
medians of the (un-normalized) degree of parallelism and the inter-arrival time 
(Figure 4). The processor allocation flexibility could be replaced by the median of 
total CPU work, to give a slightly lower but still excellent goodness of fit. 

So, a general model of parallel workloads will accept these three parameters as 
input. It would use the highly positive correlations with other variables to assume 
their distributions. But what should we give it? The processor allocation flexibility is 
usually known, since it is one of the published characteristics of the modeled 
computer. Therefore it seems like the best estimator for the level of total CPU work in 
the modeled system. As for the two other medians, a way is needed to determine 
whether it would be above, below or around average. While we can offer little 
guidance on how that could be done, apart from using past knowledge as section 6 
suggests, we can tell what the averages are - the LLNL log seems to represent it, and 
Lublin's model looks like an even more accurate choice (Figure 4). 

The third statement we make deals with the preferable way to alter a workload's 
load. There are basically three ways to raise the load: Lowering the inter-arrival time, 
raising the runtimes, and raising the degree of parallelism. The most common [19, 22] 
technique is to expand or condense the distribution of one of these three fields by a 
constant factor. Note that by doing so the median and interval (any interval) is also 
multiplied by the same factor. 

Our choice, of which field to alter, should be derived from the correlations between 
the runtime load and the three variables that are candidates to change it. We would 
choose lowering inter-arrival times if it were negatively correlated with load, and 
raising runtimes or parallelism if they were positively correlated with it. By doing so, 
we minimize the side effects of raising the load on other features of the workload. 

Using this logical criterion, we have mostly bad news. First, from Figure 1 it is 
clear that systems with a higher average load have a higher inter-arrival time median, 
not a lower one. Second, the runtimes of job are not correlated to the load. And third - 
this is the only optimistic part - the degree of parallelism is indeed positively 
correlated with load, but the correlation is far from full. 

This means that a correct way to raise a system's load would end up with higher 
inter-arrival times, about the same runtimes, and somewhat more parallelism. None of 
the three simplistic ways to alter the load satisfy these conditions - they all contradict 
it. Correctly varying a given workload model's load is not as simple as it looks; it 
seems to require changes to the distributions of the inter-arrivals, runtimes, and 
parallelism that the variables we chose for our analysis do not expose. 

Such findings, in this case about the right method to change a workload's average 
load, call for a generalization to a methodological rule. Since most modeled variables 
are correlated to each other, any assumption of the kind "in order to change X I'll 
change Y, and everything else will remain the same" is bound to be wrong. 
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9. Self-Similarity 

Recent studies of traces of Ethernet network traffic [17], web server traffic [25], 
and file system requests [26] have revealed an unexpected property of these traces, 
namely that they are self-similar in nature. Intuitively, self-similar stochastic 
processes look similar and bursty across all time scales. Physical limitations, such as 
the finite bandwidth and lifetime of any network or server, inhibit true self-similar 
behavior, but the presence of self-similarity over considerably long amounts of time 
(months, in the case of multi-computers), makes this phenomenon of practical 
importance. 

The major consequences of self- similarity are in the area of scheduling: The 
common theory of workload modeling usually assumes that while a single resource 
may exceed its resource consumption average, its aggregate resource requirements 
over time will have a low variance. In a self-similar system, this is not the case. 
Therefore, the discovery of self-similarity in parallel workloads should lead to a 
reassessment of known schedulers, to determine their compatibility with long-range 
dependence and highly variant workloads. 

We have tested both the production and synthetic workloads for self-similarity 
using three different methods. A description of these methods - R/S analysis. 
Variance-time plots and Periodogram analysis - is given in the appendix, together 
with a formal definition of self-similarity. For a more thorough theoretical 
presentation of the issue, see [27]. 





Used Processors 


Run Time 


Total CPU Time 


Inter-Arrival Time 




R/S 


V-T 


Per. 


R/S 


V-T 


Per. 


R/S 


V-T 


Per. 


R/S 


V-T 


Per. 


Variable 


rp 


vp 


PP 


rr 


vr 


pr 


rc 


VC 


pc 


ri 


vi 


pi 


CTC 


0.71 


0.71 


0.68 


0.55 


0.75 


0.76 


0.29 


0.65 


0.56 


0.42 


0.63 


0.68 


KTH 


0.74 


0.87 


0.67 


0.68 


0.58 


0.79 


0.61 


0.67 


0.56 


0.48 


0.69 


0.71 


LANL 


0.60 


0.90 


0.82 


0.74 


0.90 


0.77 


0.65 


0.88 


0.76 


0.67 


0.91 


0.68 


LANLi 


0.96 


0.81 


0.91 


0.80 


0.80 


0.84 


0.71 


0.79 


0.70 


0.86 


0.59 


0.84 


LANLb 


0.52 


0.78 


0.78 


0.66 


0.81 


0.71 


0.68 


0.80 


0.71 


0.71 


0.79 


0.66 


LLNL 


0.84 


0.74 


0.84 


0.88 


0.74 


0.69 


0.77 


0.69 


0.72 


0.56 


0.43 


0.71 


NASA 


0.61 


0.68 


0.84 


0.53 


0.66 


0.56 


0.43 


0.60 


0.55 


0.60 


0.35 


0.51 


SDSC 


0.50 


0.77 


0.68 


0.54 


0.85 


0.70 


0.53 


0.83 


0.60 


0.66 


0.96 


0.67 


SDSCi 


0.61 


0.59 


0.94 


0.83 


0.61 


0.58 


0.62 


0.59 


0.56 


0.80 


0.74 


0.64 


SDSCb 


0.68 


0.83 


0.72 


0.84 


0.76 


0.68 


0.83 


0.79 


0.58 


0.82 


0.84 


0.56 


Lublin 


0.47 


0.47 


0.48 


0.55 


0.80 


0.67 


0.55 


0.80 


0.67 


0.45 


0.49 


0.47 


Feitelson 97 


0.64 


0.62 


0.80 


0.72 


0.62 


0.72 


0.67 


0.58 


0.70 


0.49 


0.49 


0.54 


Feitelson 96 


0.72 


0.57 


0.65 


0.26 


0.61 


0.69 


0.26 


0.60 


0.68 


0.55 


0.48 


0.50 


Downey 


0.46 


0.49 


0.50 


0.54 


0.48 


0.49 


0.60 


0.47 


0.49 


0.55 


0.46 


0.49 


Jann 


0.69 


0.57 


0.59 


0.49 


0.49 


0.49 


0.64 


0.51 


0.51 


0.61 


0.50 


0.54 



Table 3: Estimations of Self-Similarity 
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The results are summarized in Table 3. The value given for each test and each 
variable is the Hurst parameter estimation for that variable by the test. In short, the 
Hurst parameter measures the degree of self-similarity in a time series: It is 0.5 if the 
series is not self-similar, and is closer to 1.0 the more self-similar it is. Although all 
three tests are only approximations and do not give confidence intervals to the value 
of the Hurst parameter, it seems certain that most workloads are self-similar in all of 
the tested variables. This new result adds to similar findings in networks and file 
systems, and hints that most “human generated” workloads, in which tens or more of 
people are involved in creating, will exhibit self-similarity to some degree. 

As self-similarity was unknown when the synthetic models we use were created, it 
is plausible that they will not exhibit the phenomenon. We ran Co-plot on Table 3, 
without any of the variables used in the earlier sections. Adding the previous group of 
variables resulted in a high degree of alienation and low correlations for many 
variables, in both variable groups. This happens because of the two dimensional 
nature of Co-plot: When using too many variables that are not well correlated, two 
dimensions are just not enough to present a coherent picture. It is then necessary to 
use less variables, or split the available variables into groups that belong to different 
“explanation universes”. 

Figure 5 does not include three estimators out of the twelve given in Table 3: The 
R/S analysis estimation of the degree of parallelism and of the total used CPU time, 
and the periodogram estimator of the total used CPU time. These variables were 
removed because of relatively low correlations; however, they also pointed to the left 
direction of the map, as do all the other variables. All estimators of the degree of 
parallelism are equal by definition of self-similarity for the absolute and normalized 
versions, so there was no need to test both variables separately. 

There are two clear results. First, all the production workloads except for NASA 
Ames show some degree of self-similarity, while all the synthetic models do not. Uri 
Lublin’s model, on the upper right corner, is apart from the other models, but this is 
not because of high Hurst parameter estimators but because of very low ones, which 
goes against what is seen in the production logs. The Feitelson ’97 model has the 
highest self-similarity, possibly due to the inclusion of repeated job executions. In 
general, it is clear that the models do not capture the self-similar nature of parallel 
computer workloads. 

Second, the three different estimators for the Hurst parameter have major 
differences. We expected the three estimators of each variable to be high correlated, 
but this happens only occasionally. For example, while the Variance-Time and 
Periodogram estimators for run time self-similarity are highly correlated, the R/S 
analysis estimator for the same variable is uncorrelated to them. The two estimators 
for inter-arrival time are almost uncorrelated, and the same goes for the degree of 
parallelism. 

Therefore, it is best to refrain from conclusions in the spirit of “workload X has 
high run time self-similarity but low inter-arrival self-similarity”. The only conclusion 
that is supported by all estimators is the fact that the production workloads are self- 
similar while the synthetic models are not, because all the arrows point leftwards - 
where the production workloads are. 
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Although further checking is required, it seems that computers with similar 
attributes produce similar self- similarity levels. For example, the CTC and KTH logs, 
which are both SP2 machines scheduled with EASY, are very close to one another, 
and the batch jobs of LANL and SDSC are also neighbors. We defer this issue for 
now, since finding out the definite causes of self-similarity requires more workloads, 
particularly ones coming from similar computers. 



10. Conclusions and Future Research 

This paper makes two important contributions. First, it presents the Co-plot 
technique, a multivariate statistical method tailored to the demands of the workload 
modeling field: It works well given few observations and many variables, dependent 
or not. Second, it provides new insights about the majority of production workloads 
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and synthetic models available today, giving a clear view of what needs to be done 
next. 

One path to continue by is to find more variables that correlate well with the ones 
already found, and are known for the modeled system. The major problem with the 
parametric model approach suggested in section 8 is the need to estimate the medians 
of the degree of parallelism and inter-arrival times. We wish to model a future system, 
so these are not known in advance. They should be replaced by known variables - we 
have tried the processor allocation scheme, scheduler, number of processors in the 
system, and the expected number of users. We intend to look further into robust 
estimators of the third moment, user or multi-class modeling attributes [2], and the 
self- similarity of the distributions [17]. 

A second path to be taken should try to estimate these unknown medians of a 
future system based on similar systems from the past. As we have seen, this approach 
seems to work in some cases but breaks down in others, and it remains to be 
discovered which changes made to a system - adding processors, replacing the 
scheduler, changing policies and so forth - cause which changes to the resulting 
workload. 

A third question raised is the issue of changing the load of a workload. It was 
shown that the currently used techniques cause harmful (in the modeling sense) side 
effects to the workload, by contradicting the expected correlations between the altered 
variables. Finding the right way to control load is of practical concern to many 
experiments and statistical tests in this field of research. 

Self-similarity is expected to play a significant role in future synthetic models, not 
only in the area of parallel computer workloads. The lack of a suitable model that 
represents self- similarity is apparent, and a new model is a near future requirement. 
However, although it is clear that none of the models exhibit self-similarity, the effect 
of this absence has not yet been determined, and this needs to be done as well. 

We'd be more than happy to share the data sets and tools we used. The production 
workloads in standard workload format and the source codes of synthetic models are 
available from the online archive at http://www.cs.huji.ac.il/labs/parallel/workload. 
The Co-Plot program and workload analysis program (both under Windows) are 
available from the authors. 
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Appendix: Theory of Self- Similarity 



This section briefly describes the theory behind self-similarity, and three methods 
for finding it in a given time series. For a thorough introduction, see [27]. 

Consider a stochastic process X = (Xi, X2, ...) with mean |x = E[XJ, variance = 
E[(Xi - |x)^], and auto-correlation function: 



r(k) = 



E[(X-juf] 



k>0 



(5) 



The process is said to exhibit long-range dependence if: 



r(k) ~ k'^L( t) for some 0<f)<2 as k^°° (6) 



where L(t) is a slowly varying function with the property: 



lim L(tx) _ ^ 
f ^ “ L(0 



Vy > 0 



(7) 



Self-similar processes possess long-range dependence, but also satisfy stronger 
constraints on the form of the auto-regression function, to be defined now. Eor a time 
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series {X}, A new aggregated series : k = 1, 2, . . .) for each m = 1, 2, 3. . . 

is obtained by averaging non-overlapping blocks of size m from the original series: 

_ _ c ^ '1 ^km - m + I ^km , , 

Xk^’^^ = It > 1 (8) 

m 



The process X is said to be exactly second-order self-similar if there exists 
0<p<2 such that the following two conditions hold: 



Var(X*"‘^) oc for all m = 1, 2, 3, . . . 

/"‘\k) = r(k) forallk= 1,2, 3, ... 



The process is said to be asymptotically second-order self-similar if the following 
weaker conditions holds instead (note that only the second condition changes): 



Var(X*"'^) oc for all m = 1, 2, 3, . . . 

r(k) as m — ^ CO 



(10) 



For historic reasons. Self-similar processes are characterized by their Hurst 
parameter, defined: 



H =\ 



1 

2 



( 11 ) 



The rescaled adjusted range (R/S) statistic for a series X haying ayerage A(n) and 
sample yariance S^(n) is giyen by: 

R(n) / S(n) = [ 1 - S(n) ] x [ max(0, Wi, . . ., W^) - min(0, Wi, . . ., W^) ] ( 12 ) 

Where: 

Wk = (Xi+X 2 +...+Xk)-kA(n) (k>l) ( 13 ) 

Short-range dependent obseryations seem to satisfy E[ R(n) / S(n) ] = Con°'^, while 
long-range dependent data, such as self-similar processes, are obseryed to follow: 

E [ R(n) / S(n) ] = CqU^ (0 < H < 1) ( 14 ) 

This is known as The Hurst Effect , and can be used to differentiate self-similar 
from non-self-similar processes. In general, the Hurst parameter can be in one of three 
categories: H < 0.5, H = 0.5 and H > 0.5. When H = 0.5 a random walk is produced, 
and no long-range dependence is obseryed. When H > 0.5, the yalues produces are 
self-similar with positiye correlation, or persistent; when H < 0.5 the yalues are self- 
similar with negatiye correlation, also called anti-persistent. Most obseryed self- 
similar data to date is persistent. 



Equation 14 can be used to produce an estimate of the Hurst parameter of a giyen 
trace, obserying that the following is deriyed from it: 

log[R(n)/S(n)]=Ci+Hlog(n) 



( 15 ) 
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Plotting R(n)/S(n) against log(n) for increasing values of n should therefore result 
in a roughly linear line, with a slope equal to the estimated Hurst parameter. Such a 
graph is called a Pox Plot , and the technique is called R/S analysis . 

As you recall from their definition, self-similar processes must satisfy: 

Var(X*"‘^) oc for all m = 1, 2, 3, . . . (16) 

Taking the logarithm of both sides of the equation gives: 

log [ Var(X‘”^) ] = C 2 - P log(m) (17) 

Again, plotting log(Var(X*™')) against log(m) for a self-similar process should 
result in a linear series of points with a slope of -p. The Hurst parameter estimate is 
H = 1- p/2, therefore a slope between -2 and 0 indicates self-similarity (0.5 < H < 1). 
This plot is known as a Variance-Time Plot . 



Finally, the Periodogram is a statistical method to discover cycles in time series. 
The periodogram of a time series {X}i n for a given frequency -H < co < H is defined 
as: 



Per(w) = — X 
N ' 



f N Y r ^ 

^ Xicos(wk) -I- ^ Xisin(wk) 



k=l 



k=l 



(18) 



The periodogram graph of a time series is computed by plotting a log-log graph of 
Per(cOi) against the following frequencies: 

OX = i = 0..N (19) 

N 

For a self-similar time series, the slope of the periodogram is a straight line with slope 
P - 1 = 1 - 2H close to the origin. 
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Abstract. The evaluation of parallel job schedulers hinges on the work- 
loads used. It is suggested that this be standardized, in terms of both 
format and content, so as to ease the evaluation and comparison of dif- 
ferent systems. The question remains whether this can encompass both 
traditional parallel systems and metacomputing systems. 

This paper is based on a panel on this subject that was held at the 
workshop, and the ensuing discussion; its authors are both the panel 
members and participants from the audience. Naturally, not all of us 
agree with all the opinions expressed here... 



1 Introduction 

1.1 Motivation 

The study and design of computer systems requires good models of the workload 
to which these systems are subjected, because the workload has a large effect 
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on the observed performance. This need was recognized long ago [26,1], and in 
several fields workload data was indeed collected, analyzed, and modeled. Well- 
known examples are address traces used to analyze processor cache performance 
[56,59], and records of file system activity used to motivate the use of file caching 
[49] . Recently we are witnessing a large increase in such activity, with data being 
collected relating to LAN traffic [45], web server loads [.3], and video streams [43]. 

This new wave of collecting and analyzing data for use in evaluations is 
also present in the field of job scheduling on high-performance systems. Two 
approaches can be identified. One is to collect the data, describe it [21,60,37], 
and use it directly as input for future evaluations. This has the benefit of be- 
ing considered completely realistic, but also suffers from various methodological 
concerns such as the danger that the data reflects local constraints rather than 
general principles [41,36]. The other approach is to use the data as a reference 
in designing workload models that are used to drive the evaluation. By select- 
ing only invariants found in several data sets for inclusion in the model, the 
confidence in the model is improved [18,16]. 

A problem that remains is that too many workloads are now available, be they 
naive models based on guesswork, complex models based on measurements, or 
the measurements themselves. Faithful comparisons of different schemes require 
a representative set of workloads to be canonized as a benchmark, and used by all 
subsequent studies. The definition of a standard benchmark should include both 
the benchmark data (or a program to generate it), and its format, to enable 
efficient and easy use. Our goal in this paper is to explore the possibility of 
creating such a standard. 



1.2 Scope 

Application scheduling versus job scheduling Benchmarks are only use- 
ful if they sufficiently represent their target community. For instance, SPEC 
benchmarks have been carefully selected to cover a wide range of different appli- 
cations. Similarly, benchmarks for the evaluation of parallel job schedulers must 
be based on the applications typically run on those parallel machines. Using a 
slightly simplified view we can distinguish two classes for these applications: 

— Rigid applications^ which are fine tuned for a specific parallel machine and 
configuration. The most common examples are programs written in the mes- 
sage passing paradigm, where all communication between the processors is 
carefully arranged to achieve a large degree of latency hiding. Such programs 
cannot cope with situations where the number of processors is reduced even 
by one during the execution, and there is also no benefit from assigning 
additional processors, as they will remain unused. 

^ This includes moldable applications [25] which are written so that they can run on 
different numbers of processors as chosen when the job starts execution; the point is 
that the job cannot change during execution, so there is no application scheduler. 
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— Flexible applications^ which can be run on a variety of different machine 
configurations. Typically, a high degree of efficiency can only be achieved for 
these jobs if they are made adaptable to the actual configuration. Therefore, 
they frequently consist of a large number of interdependent modules for 
which a suitable schedule must be generated. A simple approach is to use a 
master- workers structure. 

Based on these two applications classes it is also appropriate to distinguish 
two types of schedulers: machine schedulers and application schedulers. Machine 
schedulers for large parallel machines are, naturally, machine-centric. They typ- 
ically do not look much inside a job. As input they receive characteristic data 
from a stream of independent jobs. Computing resources, like processors, mem- 
ory, or I/O facilities, are allocated to these jobs with the goal of optimizing the 
value of the actual scheduling objective function. Therefore, machine schedulers 
try to keep the number of unassigned resources at a minimum while load bal- 
ancing within a job is up to the owner of the job. Machine schedulers must deal 
with the on-line character of job submission and with a potential inaccuracy of 
job submission data, like the estimated execution time of a job. On the other 
hand they need not consider dependences between the submitted jobs. The per- 
formance of a machine scheduler may be highly dependent on the workload and 
possibly on the given objective function. Having a representative workload may 
therefore allow the administrator of a parallel machine to determine the sched- 
uler best suited for him. Hence, those administrators can be assisted by a set of 
benchmarks that cover most workloads occurring in practice. 

Application schedulers, on the other hand, arrange the modules of flexible 
applications to make best use of the currently available resources. They do not 
consider other independent jobs running concurrently on the same machine. 
Therefore, they are application-centric. Typically, it is their goal to minimize 
the overall execution time of their applications. To this end they must consider 
the dependences between the various modules of their applications. All modules 
are known to the schedulers up front. While quite a few different algorithms for 
application schedulers have been suggested it is not clear whether their perfor- 
mance varies significantly for different applications. It may therefore be possible 
to evaluate application schedulers with the help of a generic application model. 
In this case benchmarks for application schedulers are not needed. 

But if application schedulers start to proliferate they may significantly influ- 
ence the workload characteristics of parallel machines, changing it from being 
predominantly rigid to mostly flexible. It is also possible that machine schedulers 
and application schedulers may cooperate in the future to make best use of the 
available resources. The state of the art in workload benchmarking for rigid jobs, 
and questions about extending it to flexible jobs, are discussed in Section 2. 



Scheduling for metacomputing and its requirements A recent area of 
research is how to collect resources from many organizations into entities called 

^ This is taken to include both malleable and evolving iohs in the terminology of [25]. 



70 



Steve J. Chapin et al. 



metasystems or computational grids [28]. A metasystem consists of comput- 
ers, networks, databases, instruments, visualization devices, and other types of 
resources owned by different organizations and located around the world. In 
addition to these resources, a metasystem contains software that people use to 
access it. There are several projects that provide such software [27,33,46,6] and, 
among many other things, this software supports meta schedulers: schedulers 
that help users select what resources to use for an application and help users to 
execute their application on those resources. 

While there are many types of meta schedulers, they often have several com- 
mon requirements. First, a user or meta scheduler has a larger and more diverse 
set of resources to pick from than those present in a single supercomputer. A 
meta scheduler therefore needs information about resources and applications to 
determine which resources to select for an application. A meta scheduler needs 
to know when resources are available, what they cost, which users have access to 
them, how an application performs on them, etc. Information on current avail- 
ability of resources is easily available and there is ongoing work on predicting 
the future availability of network bandwidth [61] and when a scheduler will start 
applications [57,14]. Predictions of application performance on various sets of re- 
sources is also being investigated [6] . Even though this information is becoming 
available, an additional need is a common way to gain access to this informa- 
tion such as the Metacomputing Directory Service provided by the Globus [27] 
software. 

In addition to the new types of information described above, many meta 
schedulers need resources from more than one source — similar to the idea of 
gang scheduling on parallel machines [22] . This requires mechanisms for gaining 
simultaneous access to resources. One such mechanism is reserving resources 
at some future time. Mechanisms for network quality of service [29] allow such 
reservation of networking resources and reservation mechanisms are currently 
being added to scheduling systems for parallel computers [54]. 

The issues of benchmarking the application schedulers for metacomputing 
are discussed in Section 3, and the relationship between scheduling on parallel 
systems and metasystems are examined in Section 4. 



Possible inclusion of the objective function The measured performance of 
a system depends not only on the system and workload, but also on the metrics 
used to gauge performance. It is these metrics that serve as the objective func- 
tion of the scheduler, whose goal is to optimize their value. For some objective 
functions, such as utilization and throughput, the goal is to maximize; for others, 
such as response time or slowdown, the goal is to minimize. 

The problem is that measurement using different metrics may lead to conflict- 
ing results. For example, one of the papers in the workshop showed contradicting 
results for the comparison of two scheduling algorithms if response time or slow- 
down were used as a metric [31]. Another paper [42] specifically addressed the 
issue of deriving objective functions tailored to a set of owner defined policy 
rules. This paper also showed significant differences in the ranking of various 
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scheduling algorithms if applied to objective functions that only differ in the 
selection of a weight. It may therefore be appropriate to standardize the objec- 
tive functions that are used, in order to enable a truthful comparison between 
different studies. However, this is only appropriate if a large number of differ- 
ent objective functions are used in practice and if machine schedulers produce 
significantly different results for those different objective functions. Currently, 
only a few standard objective functions — like the average response time or the 
machine utilization — can be found in almost all installations. However, it is 
not clear whether this small number is due to a missing concept for generating 
objective functions that are better tailored to the rules of the owners of parallel 
machines. 

In this paper we do not discuss this issue further. We just note that further 
research into the relative merits of different metrics is needed [24] . 

2 Workload Benchmarks for Parallel Systems 

A mere five years ago practically no real data about production workloads on 
parallel machines was available, so evaluations had to rely on guesswork. This 
situation has changed dramatically, and now practically all evaluations of parallel 
job schedulers rely on real data, at least to some degree. While more details can 
always be added, the time seems ripe to start talking about standardization of 
workload benchmark data. 

2.1 State of the Art 

A large amount of data on production parallel supercomputers has been collected 
in the Parallel Workloads Archive [19]. This includes both raw logs and derived 
models. 

Workload logs Most parallel supercomputers maintain accounting logs for 
administrative use. These logs contain valuable information about all the activity 
on the machine, and in particular, about the attributes of each job that was 
executed. The format of the logs is typically an ASCII file with one line per job 
(although some systems maintain a much more detailed log). Analyzing such 
logs can lead to important insights into the workload. Such work has been done 
for some systems, including the NASA Ames iPSC/860 [21], the SDSC Paragon 
[60], the CTC SP2 [37], and the LANL CM-5 [17]. 

While most logs contain the same core data about each job (such as the 
submittal, start, and end times, the number of processors used, and the user 
ID), there are other less-standard fields as well. Some systems contain data 
about resource requests made before the job started. Some contain data about 
additional resources such as memory usage. Some contain internal data about 
the queue to which the job was submitted, and prioritization parameters used by 
the scheduler. Moreover, these fields appear in different orders and formats. The 
standard format suggested below attempts to accommodate all the important 
and useful fields, even if they do not appear in every log. 
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Workload models Workload models are based on some statistical analysis 
of workload logs, with the goal of elucidating their underlying principles. This 
then enables the creation of new workloads that are statistically similar to the 
observations, but can also be changed at will (e.g. to modify the system load) [16]. 

The most salient feature of workload models is that they include exactly what 
the modeler puts into them. This is both an advantage and a disadvantage. It 
is an advantage because the modeler knows about all the features of the model, 
and can control them. It is a disadvantage because real workloads may contain 
additional features that are unknown, and therefore not included in the models. 
As the effect of various workload features is typically not known in advance, it 
is prudent to at least include as many known workload features as possible. 

Current workload models fall into two categories: those of rigid jobs, and 
those of flexible jobs. Rigid job models create a sequence of jobs with given 
arrival time, number of processors, and runtime (e.g. [18,39,47]). The task of the 
scheduler is then to pack these “rectangular” jobs onto the machine. Given the 
relative simplicity of rigid jobs, a number of rather advanced models have been 
designed. A statistical analysis [58] shows that the one proposed by Lublin [47] 
is relatively representative of multiple workloads. 

Flexible job models attempt to describe how an application would perform 
with different resource allocations, and maybe even how it would perform if the 
resources are changed at runtime. One way to do this is to provide data about 
the total computation and the speedup function [55,13], instead of the required 
number of processors and runtime. This enables the scheduler to choose the 
number of processors that will be used, according to the current load conditions. 
Another approach is to provide an explicit model of the internal structure of 
the application [7,24]. This allows for a detailed simulation of the interactions 
between the scheduling and the application, leading to better evaluations at the 
cost of more complex simulation. While several models have been proposed, there 
is still insufficient data about the relative distribution of applications with differ- 
ent speedup characteristics and internal structures to allow for any statements 
regarding which is more representative. 



2.2 Future Work 

Workload models may be improved in three main ways: by including additional 
resources, such as memory and I/O, by including feedback, and by including the 
internal structure of parallel programs. In addition, the evaluation of schedulers 
will benefit from data about outages that schedulers have to deal with. 



Including memory requirements and I/O Current workload models con- 
centrate on one type of resource: computing power. However, in reality, jobs 
require other resources as well, and the interaction between the demands for 
different resources can have a large effect on possible schedules. 

One resource that has received some attention is memory. Several papers ac- 
knowledge the importance of memory requirements and their effect on schedul- 
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ing [2,51,50]. However, there is only little data about actual memory usage pat- 
terns [17], and this has so far not been incorporated in any workload model. 
Moreover, it is necessary to model not only the total amount of memory that is 
used, but also the degree of locality with which it is accessed, as this has a great 
impact on the amount of memory that has to be allocated in practice [4]. 

Another important characteristic that has a significant impact on schedul- 
ing is I/O activity. The Charisma project has collected some data on the I/O 
behavior of parallel programs [48]^, but this has only been used for the design 
of parallel file system interfaces. We are only beginning to see considerations of 
I/O in scheduling work [44,53], but this is so far not based on much real data. 
As real applications obviously do perform I/O (and sometimes even a lot of it), 
this is a severe deficiency in current practice. 

For both memory and I/O, we do not have enough data yet for contemplating 
a standard benchmark, at least not one that is known to be representative and 
is based on measurements. 



Including feedback Another problem with current workload models is the lack 
of feedback. The observed workload on a production machine is not created by 
random sampling from a population of programs. Rather, it is the result of inter- 
leaving the sequences of activities performed by many human beings. Activities 
in such sequences are often dependent on each other: you first edit your pro- 
gram, then compile it, and then execute it; you change parameters and execute 
it again after observing the results of the previous execution. Thus the instant 
at which a job is submitted to the system may depend on the termination of a 
previous job. As the time of the previous termination depends on the system’s 
performance, so does the next arrival. In a nutshell, there is a feedback effect 
from the system’s performance to the workload. 

The realization that such feedback exists is not new. In fact, feedback has 
been included explicitly in some queueing studies, especially those employing 
closed queueing networks with a delay center representing user think time in the 
feedback loop (see, e.g., [38]). However, this practice has so far not extended to 
performance analysis based on observed workloads, because it does not appear 
explicitly in the observations. Accounting logs do not include explicit informa- 
tion about feedback, so this effect is lost when a log is replayed and used in an 
evaluation. However, it is possible to make educated guesses in order to insert 
postulated dependencies into an existing log. The methodology is straight for- 
ward: we identify sequences of dependent jobs (e.g. all those submitted by the 
same user in rapid succession), and replace the absolute arrival times of jobs in 
the sequence with interarrival times relative to the previous job in the sequence. 



Including the internal job structure The feedback noted above is between 
the system and the user, and may affect the arrival process. There is also a 

® A historical note — the Charisma data actually triggered the first study of a pro- 
duction parallel workload in [21]. 
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possibility of feedback between the system and the parallel job itself. Specifically, 
the synchronization and communication patterns of the application may have 
various performance implications, that depend on how the application’s processes 
are scheduled to different processors [35,23]. 

For example, earlier work in the sigmetrics community compared space slic- 
ing with time slicing. Two orthogonal issues were allocation of processing power 
among jobs and support for interprocess synchronization (IPS). The space slicing 
work recognized the importance of processing power allocation and developed 
dynamic and/or adaptive algorithms. Some of the algorithms necessitated fairly 
complicated mechanisms to ensure processor allocations could be changed and 
not hurt interprocessor synchronization. If synchronization is frequent, then ei- 
ther gang scheduling or IPS cognizant space slicing mechanisms are needed, but 
if common IPS is coarse grained it may be unnecessary. Assuming it is neces- 
sary, it may still be possible that IPS is coarse grained enough when doing gang 
scheduling that alternates could be fragments rather than requiring complete 
gangs be coscheduled. 

In last year’s introductory paper we presented a strawman proposal of how 
the internal structure of a parallel application can be summarized by a small 
number of parameters [24]. The main parameters were the number of processors, 
the number of barriers, the granularity, and the variance of these attributes. 
While this cannot capture the full spectrum of possible parallel applications, it 
is expected to provide enough flexibility in order to create a varied workload that 
will exercise the interactions between applications and the scheduler in various 
ways. 

The problem with including internal structure in the workload benchmark 
is the complete lack of knowledge about what parameter values to use. This in- 
formation could be collected by augmenting a library providing synchronization 
facilities to trace this information (as was done in Charisma for the I/O library). 
This functionality already exists in PVM and Legion for example. If the library 
is a dynamic library then theoretically it would be easy to take someone’s code 
and measure it. Such an undertaking has to be done at a large production site, 
provided it would not slow down users production level codes for measurement 
purposes. 

An obvious alternative to modeling the internal structure is to use real ap- 
plications [62,12]. However, the question remains of which applications to use, 
in what mixes, and how to create different sizes. This again boils down to the 
question of how to create a representative workload, and the lack of data about 
the relative popularity of different application types. 



Including outage information While simulations and models are useful for 
comparing different algorithms, in the real world, there are many more variables 
that come into play than the few that are typically used in scheduling models. 
If the purpose of running a new scheduling algorithm through a simulator on a 
real workload is to measure how well that algorithm will work in production on 
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a similar workload, then it cannot possibly be accurate if it ignores all factors 
external to a scheduler’s trace file. 

Parallel systems have matured considerably over the past decade, but still 
are not as stable or reliable as traditional vector systems like the Cray C90. This 
instability should be taken into consideration when creating a scheduler simula- 
tor. Such factors as node failure, network interruption, disk failure, mean time 
between failure, and length of failures are important variables that a production 
scheduler has to cope with. In a distributed memory system like the IBM SP, it 
is possible for a node to drop offline, but the system continues to operate. Any 
job running on that node would have to be restarted, but it has no affect on 
any other running jobs. The system scheduler detects the failed nodes, and takes 
action to schedule around the failed hardware. This information however is not 
recorded in typical job trace files, and is therefore not taken into account during 
the analysis of the traces. 

Another important aspect of system availability is the impact of human- 
generated outages. All production systems are taken down for scheduled main- 
tenance and often for dedicated time. This outage information is often available 
to the job scheduler so that jobs can be scheduled around the outages, or such 
that the system is drained up to the outage. This information does not appear 
in the scheduler trace files, but is needed input for simulators. Most sites col- 
lect outage data, and many archive it for historical comparisons (like NAS). A 
standard format for outage data should be created to compliment the scheduling 
workload traces. The two datasets should be keyed to each other, and should 
contain the necessary information to accurately predict scheduler behavior in a 
real work environment. 

As an initial start, we propose the following information should be collected 
and reported in a standard format, for every outage that removes any portion 
of a system from operation: 

— Announced time of outage (e.g. when did the outage info become available 
to the scheduler — was it known in advance, or did the scheduler suddenly 
detect that there were fewer nodes available?) 

— Start time of outage (when the outage actually occurred) 

— End time of outage (when the affected resources were again schedulable) 

— Type of outage (CPU failure, network failure, facility) 

— Number of nodes affected (or perhaps percentage of machine affected — 
for example, a failed scratch file system may prevent only a few users from 
running, but the others can continue.) 

— Specific affected components (which nodes went down, what part of the 
network failed) 



2.3 A Standard Workload Format 

The goal of the standard format is to help researchers using workloads, either 
real or synthetic. Its main advantages over what is currently available are: 
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— Ideas and tests regarding workload models could be easily applied to all 
available workloads. This is rarely done because of the need to write scripts 
to handle the different formats of workloads today. 

— The file format is easy to parse and use: while it is a text file (to avoid prob- 
lems with converting data files) all data is in integers (no character strings!), 
so there are no problems with parsing dates or other special entries. This 
provides simplicity and absolute standardization at the expense of general- 
ity and extensibility: you are guaranteed to be able to parse and understand 
every file abiding by the standard, because users cannot add their own new 
fields. 

— Every datum must abide to strict consistency rules, that when checked ensure 
that the workload is always “clean” . 

— Data is in standard units. Moreover, users and executables are given by in- 
cremental numbers, which makes their parsing easier, makes grouping by 
users/executables easier, hides administrative issues, and hides sensitive in- 
formation. 

A major design goal was to be able to use the format for both real and 
synthetic workloads. This means that only some of the fields will usually be 
meaningful for any given workload — a synthetic workload may only include in- 
formation about submit times, runtimes, and parallelism, while a real workload 
won’t include any information about scheduler feedback. Therefore, unknown 
values are part of the standard. The fields were chosen so that all information 
from logs we have will be saved except very rare fields (that appeared in only one 
log, for example). For synthetic workloads, future research directions were also 
considered: For example, the format enables expressing the existence of sched- 
uler feedback, which can be generated using a variety of models. The internal 
structure (I/O, barriers, and so forth) of jobs is still not included, since no logs 
and only one model address this issue and the right way of doing it is still un- 
clear. Future version of the standard may include additional fields for this and 
other purposes. 



The data fields Standard workload files contain one line per job, that contains 
a list of space separated integers. Missing values are denoted by -I, and all 
other values are non-negative. Lines beginning with a semicolon are treated as 
comments and ignored. The beginning of every file contains several such lines 
that describe the workload in general. The jobs are numbered consecutively in 
the file. Job IDs from workloads that are converted to the standard format are 
discarded, since they are not always integers and not always unique (if they 
combine data from several years). Each line in the file has these fields, in this 
order: 

1. Job Number — a counter field, starting from 1. 

2. Submit Time — in seconds. The earliest time the log refers to is zero, and 
is the submittal time the of the first job. The lines in the log are sorted by 
ascending submittal times. 
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3. Wait Time — in seconds. The difference between the job’s submit time and 
the time at which it actually began to run. Naturally, this is only relevant 
to real logs, not to models. 

4. Run Time — in seconds. The wall clock time the job was running (end time 
minus start time). 

We decided to use “wait time” and “run time” instead of the equivalent 
“start time” and “end time” because they are directly attributable to the 
scheduler and application, and are more suitable for models where only the 
run time is relevant. 

5. Number of Allocated Processors — an integer. In most cases this is also the 
number of processors the job uses; if the job does not use all of them, we 
typically don’t know about it. 

6. Average CPU Time Used — both user and system, in seconds. This is the 
average over all processors of the CPU time used, and may therefore be 
smaller than the wall clock runtime. If a log contains the total CPU time 
used by all the processors, it is divided by the number of allocated processors 
to derive the average. 

7. Used Memory — in kilobytes. This is again the average per processor. 

8. Requested Number of Processors. 

9. Requested Time. This can be either runtime (measured in wallclock seconds), 
or average CPU time per processor (also in seconds) — the exact meaning is 
determined by a header comment. If a log contains a request for total CPU 
time, it is divided by the number of requested processors. 

10. Requested Memory (again kilobytes per processor). 

11. Completed? 1 if the job was completed, 0 if it was killed. This is meaningless 
for models, so would be -1. 

if a log contains information about checkpoints and swapping out of jobs, a job 
can have multiple lines in the log. In fact, we propose that the job information 
appear twice. First, there will be one line that summarizes the whole job: its 
submit time is the submit time of the job, its runtime is the sum of all partial 
runtimes, and its code is 0 or 1 according to the completion status of the whole 
job. In addition, there will be separate lines for each instance of partial execution 
between being swapped out. All these lines have the same job ID and appear 
consecutively in the log. Only the first has a submit time; the rest only have a 
wait time since the previous burst. The completed code for all these lines is 2, 
meaning “to be continued” ; the completion code for the last such line is 3 or 4, 
corresponding to completion or being killed. It should be noted that such details 
are only useful for studying the behavior of the logged system, and are not a 
feature of the workload. Such studies should ignore lines with completion codes 
of 0 and 1, and only use lines with 2, 3, and 4. For workload studies, only the 
single-line summary of the job should be used, as identified by a code of 0 or 1. 

12. User ID — a natural number, between one and the number of different users. 

13. Group ID — a natural number, between one and the number of different 
groups. Some systems control resource usage by groups rather than by indi- 
vidual users. 
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14. Executable (Application) Number — a natural number, between one and 
the number of different applications appearing in the workload, in some logs, 
this might represent a script file used to run jobs rather than the executable 
directly; this should be noted in a header comment. 

15. Queue Number — a natural number, between one and the number of different 
queues in the system. The nature of the system’s queues should be explained 
in a header comment. This field is where batch and interactive jobs should 
be differentiated: we suggest the convention of denoting interactive jobs by 
0 . 

16. Partition Number — a natural number, between one and the number of 
different partitions in the systems. The nature of the system’s partitions 
should be explained in a header comment. For example, it is possible to use 
partition numbers to identify which machine in a cluster was used. 

17. Preceding Job Number — this is the number of a previous job in the work- 
load, such that the current job can only start after the termination of this 
preceding job. Together with the next field, this allows the workload to in- 
clude feedback as described in Section 3. 

18. Think Time from Preceding Job — this is the number of seconds that should 
elapse between the termination of the preceding job and the submittal of this 
one. 

The last two fields work as follows. Suppose we know that a. out, job number 
123, should start ten seconds after the termination of gcc x.c, which is job 
number 120. We could give job number 123 a submittal time that is 10 seconds 
after the submittal time plus run time of job 120, but this wouldn’t be right 
— changing the scheduler might change the wait time of job 120 and spoil the 
connection. The solution is to use fields 17 and 18 to save such relationships 
between jobs explicitly. In this example, for job number 123 we’ll put 120 in its 
preceding job number field, and 10 in its think time from preceding job field. 



Header Comments The first lines of the log may be of the format ; Label: 
Valuel , Value2 , .... These are special header comments with a fixed format, 
used to define global aspects of the workload. Predefined labels are: 

Computer : Brand and model of computer 
Installation : Location of installation and machine name 
Acknowledge : Name of person(s) to acknowledge for creating/collecting the 
workload. 

Information : Web site or email that contain more information about the work- 
load or installation. 

Conversion : Name and email of whoever converted the log to the standard 
format. 

Version : Version number of the standard format the file uses. The format 
described here is version 2. 

StartTime : In human readable form, in this standard format: Tuesday, 1 Dec 
1998, 22:00:00 
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EndTime : In the same format as StartTime. 

MaxNodes : Integer, number of nodes in the computer (describe the sizes of 
partitions in parentheses). 

MaxRuntime : Integer, in seconds. This is the maximum that the system allowed, 
and may be larger than any specific job’s runtime in the workload. 
MaxMemory : Integer, in kilobytes. Again, this is the maximum the system al- 
lowed. 

AllowOveruse : Boolean. ’Yes’ if a job may use more than it requested for any 
resource, ’No’ if it can’t. 

Queues : A verbal description of the system’s queues. Should explain the queue 
number field (if it has known values) . As a minimum it should be explained 
how to tell between a batch and interactive job. 

Partitions : A verbal description of the system’s partitions, to explain the par- 
tition number field. For example, partitions can be distinct parallel machines 
in a cluster, or sets of nodes with different attributes (memory configuration, 
number of CPUs, special attached devices), especially if this is known to the 
scheduler. 

Note : There may be several notes, describing special features of the log. For 
example, “The runtime is until the last node was freed; jobs may have freed 
some of their nodes earlier” . 

3 Workload Benchmarks for Metacomputing 

Most of the resources of a conventional parallel computer are used by batch 
jobs. Therefore, job schedulers are typically not required to provide compute 
resources at a specific time. However, this has changed with the appearance of 
metacomputers. Many metasystems are based on the concept of a single virtual 
machine which can also be used to run large parallel jobs. But this requires the 
availability of compute resources on different machines at the same time. In ad- 
dition network resources may be needed as well. This can only be achieved if the 
schedulers that control the participating parallel machines accept reservations. 
Unfortunately, it is not clear how to include resource reservation into present 
scheduling algorithms. A simple approach may be an extension of backfilling. In 
the workshop some participants reported promising results with this concept. 
However, this assumes that the best time instant for such a resource reservation 
is already known. In any case, the widespread use of a parallel computer as part 
of a metasystem will certainly affect the workload and may therefore require 
new benchmarks. 



3.1 Scheduling in a Metacomputing Environment 

In the metacomputing scenario, there are many schedulers simultaneously acting 
over the system. Some of these schedulers control the resources they schedule over 
and thus constitute the access point to such resources (i.e., one has to submit 
a request to the scheduler in order to use the resources it controls). On the 
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other hand, there are schedulers that do not actually control the resources they 
use. Instead they communicate with multiple lower-level schedulers and decide 
which of them should be used, and which part of the parallel computation each 
of them should carry out. Requests to the appropriate low-level schedulers are 
then created and submitted on behalf of the user. 




Fig. 1. Entities involved in scheduling in a metacomputing environment. 



In order to keep the discussion focused, we suggest the following terminol- 
ogy and definitions (which are summarized graphically in Figure 1). We call the 
scheduler that controls a certain machine a machine scheduler, this is typically 
the OS scheduler on this machine, especially on desktop machines. On a par- 
allel supercomputer, this may be the parallel operating environment scheduler 
running on the front end, or a batch queueing system such as NQS or PBS 
used to access the machine. Parallel machines may also have node schedulers, 
which control individual nodes, usually according to the directions of the ma- 
chine scheduler (e.g. to implement gang scheduling). These are internal to the 
parallel machine implementation and therefore not relevant in a discussion of 
external workloads. Finally, there are meta-schedulers that interact with several 
machine schedulers in order to find usable resources and use them to schedule 
metacomputing applications. A special case of meta schedulers are application 
schedulers, that are developed in conjunction with a specific application, and use 
application-specific knowledge to optimize its execution. 
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In order to decide which machine schedulers to use (and what each of them 
should do), the meta-scheduler needs to know how long a given request will take 
to be processed on a given machine scheduler, under the current system load. 
That is, in order to make reasonable decisions, the meta-scheduler needs informa- 
tion on how the machines schedulers are going to deal with its requests. Although 
some have proposed mechanisms to promote effective communication among the 
different schedulers in the system [11,8], the machine schedulers currently in use 
have not been designed with this need in mind. Therefore, researchers in meta- 
computing have developed tools that monitor and forecast how long a request is 
going to take to run over a particular set of resources (e.g., [61]). 

Today there is no such tool for space-sliced parallel supercomputers. Since 
jobs run on a dedicated set of nodes in these machines, the information meta- 
schedulers can expect to obtain regards the queue waiting time. In principle, 
work on supercomputer queue time prediction [15,57,32] could be used to provide 
this information. However, the results obtained for queue time predictions are 
still relatively inaccurate, making them inadequate for many metacomputing 
applications, notably those that perform co-allocation (i.e., that spread across 
multiple machine schedulers). This has prompted the metacomputing community 
to ask for the enhancement of supercomputer schedulers by the introduction of 
reservations [29] or guaranteed computing power [30,52]. Reservations consist of a 
guarantee that a certain amount of resources is going to be available continuously 
starting at a pre-determined future time. Computing power guarantees consist 
of guarantees that a certain amount of computing power will be available over 
time, e.g. 25% of the time on 16 processors. However, there is still the question 
of how the meta-scheduler decides what is the right reservation to ask for. The 
very first efforts towards answering this question are now under way [10]. 

3.2 Components of a Benchmark Suite 

One of the challenges in building a benchmark suite is determining the appli- 
cation space to be covered, and assembling a set of applications which cover 
the space (the analog of a basis set in linear algebra). The obstacle to doing 
this is that we lack two fundamental pieces of information: what a real metasys- 
tem workload looks like, and what the appropriate axes of the application space 
should be. While we have experience running one or two applications simultane- 
ously, we do not have experience running truly large-scale systems (thousands 
to millions of nodes with hundreds to thousands of simultaneous users). We are 
therefore required to take an evolutionary approach. We will build a benchmark 
suite based on the “tools at hand” , and will refine it over time as we learn more 
about metasystem computation. 

A good first step will be to use accepted practice and generate 
micro-benchmarks: individual programs which stress one particular aspect of 
the system. For example, we can create a compute-intensive meta-application 
that can use all the cycles from all the machines it can get, a communication- 
intensive meta application that requires extensive data transfers between its 
parts, or a meta-application that requires a specific set of devices from different 
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locations. To test metacomputing schedulers, we can generate workloads con- 
sisting of large numbers of applications of a single type, and also mixed-mode 
workloads composed of diverse meta-applications. 

As a second step, we can add real-world applications which we already run on 
metasystems. These applications will be components of an overall metasystem 
workload, and can help us to understand the interactions of complex applications 
in a metasystem environment. Using this benchmark suite, we can attempt to 
determine how well particular schedulers work, both alone and in competition. 



3.3 Logging Scheduling Events in a Metacomputer 

The two traditional methods of analyzing the performance of scheduling algo- 
rithms are to simulate synthetic workloads or simulate trace data recorded from 
parallel computers. Even though synthetic workloads do not explicitly require 
trace data, a synthetic workload that is useful must approximate actual work- 
loads and therefore the characteristics of actual workloads must be known. 

It is very difficult to collect data to form a workload of the events that occur 
in a metasystem. The problems are the distributed ownership of the constituents 
of the metasystem, the many points of access to it, and its sheer size. First, the 
metasystem consists of a diverse set of resources owned by dozens of organiza- 
tions. These organizations are fully autonomous and cannot be forced to record 
the events on their local resources and provide them for a metasystem work- 
load. Also, collecting events in a large distributed system is not a trivial task. 
Clock synchronization and causal order techniques can help, but the size and 
geographic dispersion of the metasystem makes it a hard problem. Second, each 
user may have their own application scheduler and thus there may be a large 
number of different application schedulers. We cannot force these schedulers to 
record events or to provide these events for a metasystem workload. Third, even 
if we could record all of these events and form them into a workload, the system 
would probably be too large to simulate conveniently. 

There are some steps we can take toward recording a metasystem workload. 
First, events can be recorded on a subset of the metasystem. Small sets of sites 
tend to be closely aligned with each other and willing to share data with each 
other. One problem with this technique is that the resources used by users may 
not lie entirely within or without the subset we are recording. If programs use 
resources from across a sub-system boundary, important application information 
will not be recorded. Second, machine scheduling systems typically already have 
recording mechanisms to record events. Third, the current metacomputing soft- 
ware [27,33] each provide a common interface to machine schedulers and events 
can be recorded in this interface. Such trace data may provide enough data to 
extract information on which requests are co-allocation requests and are part of 
the same application. Note, however, that recording metacomputing applications 
alone would miss applications submitted directly to the local scheduler. 
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3.4 Evaluating Matacomputing Scheduling 

Another problem we have not discussed is how do we evaluate the performance 
of schedulers in metacomputing environments? First we need to recognize that 
there will be many meta schedulers with different goals. Some schedulers will try 
to run applications on single parallel computers as soon as possible, some will 
try to co-allocate resources, others will try to run many serial applications, and 
others will try to have their applications complete as soon as possible by adapting 
to resource availability. The metrics used will vary for each meta scheduler and 
will include metrics such as wait-time, throughput, and turn-around time. 

Even though we cannot record a complete metasystem workload, we can use 
synthetic data to evaluate scheduling algorithms. We have the advantage that 
we may be able to construct a synthetic workload by expanding on trace data 
from part of the metasystem and we can at least use the currently available trace 
data from parallel computers to form synthetic trace data for machine scheduling 
systems. In essence, this means that sampling is used to solve size problem, as 
has also been done with address traces [40]. More research is required to establish 
the methodological basis and limitations of this approach. 



4 Convergence 

4.1 A Comparison 

Scheduling for parallel systems has been studied for a long time, and many 
schemes have been proposed and evaluated [20]. Scheduling in metasystems is 
relatively new, and the evaluation methodology still needs to be developed. A 
relevant question is therefore the degree to which ideas and techniques developed 
for parallel systems can be carried over to metacomputing systems. 

The main difference that is usually mentioned in comparisons of parallel 
systems and metacomputing is that metacomputing deals with heterogeneity, 
whereas parallel systems are homogeneous [5]. This is in fact not so. Hetero- 
geneity comes in three flavors: architectural heterogeneity, where nodes have a 
different architecture, configuration heterogeneity, where nodes are configured 
with different amounts of resources (e.g. different amounts of memory, or differ- 
ent processors from the same family), and load heterogeneity, which means that 
the available resources are different due to current load conditions. While par- 
allel systems usually do not contain architectural heterogeneity, they certainly 
do encounter configuration and load heterogeneities. Therefore their schedulers 
need to deal with nodes that have different amount of resources available, just 
as in metacomputing. They need to make decisions based on estimates of when 
resources will become available, just as in metacomputing. They need to employ 
models of application behavior to estimate how sensitive the application is to 
heterogeneity, just as in metacomputing. They need to deal with requests for 
specific resources (such as extra memory, a certain device, or use of a specific 
license), just as in metacomputing. 



84 



Steve J. Chapin et al. 



The difference between parallel systems and metacomputing is therefore not 
a clear cut absence of certain problems, but their degree of severity. Some of 
the above issues could be ignored by parallel schedulers, at the cost of some 
inefficiency. This has been a common practice, and is one of the reasons for the 
limited utilization observed on many parallel systems. At the present time, these 
issues are beginning to be addressed. This is happening concurrently with the 
emergence of metacomputing, where these issues cannot be ignored, and have to 
be handled from the outset. 

4.2 Integration of Parallel Systems and Metacomputing 

In a metasystem environment, there is interaction between scheduling at the 
local level and scheduling at the meta level. An obvious example is that meta 
schedulers send applications to local schedulers. Another example is that the 
local schedulers can dictate what resources are available to meta applications 
by limiting the number of nodes made available to meta applications or by the 
scheduling policy used when scheduling meta applications versus locally sub- 
mitted applications. A third example is that meta applications my ask for si- 
multaneous access to resources from several local schedulers. This requires local 
mechanisms such as reservation of resources and these reservations affect the 
performance of local scheduling algorithms. 

One major question is how much interaction is there and can we evaluate 
local and meta schedulers independently or using a simple model of the other 
type of scheduler? For example, mechanisms for combining queuing scheduling 
with reservation in a local scheduler can be evaluated using a synthetic workload 
of reservation requests or a recording of reservation requests. This requires little 
to no knowledge of meta-scheduling algorithms. 

Another example is that meta schedulers can be evaluated using simple mod- 
els of local schedulers if we assume that meta schedulers will not interfere with 
each other. A simple model of a local scheduler would just model the wait time 
of applications submitted to it, the error of wait time predictions, when reserva- 
tions can be made, etc. We can assume meta schedulers will not interfere with 
each other if there are relatively few metasystem users when compared to the 
number of resources available. If meta schedulers can interfere with each other, 
we will have to simulate other meta schedulers using recorded or synthetic data. 

We must take care when designing our metrics. In the past, supercomputer 
centers have focused on low-level, system-centric metrics such as percent uti- 
lization. Metaschedulers, on the other hand, are more focused on high-level, 
user-centric metrics such as turnaround time and cost. We believe that these ap- 
parently contradictory metrics can be unified through a proper economic model. 
Utilization metrics are frequently used to justify the past or future purchase of 
a machine (“Look, the machine is busy, it must’ve been worth the money we 
spent!” or “The machine is swamped! We need to buy a new one!”), but in the 
end, all they really tell us is that the machine is busy, not how much effective 
work is being done. With an economic model, the suppliers (supercomputer cen- 
ters, et al.) can control utilization by altering the cost charged per unit time. 
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Users can employ personal schedulers to optimize their important criteria. In 
the end, this step has to be taken if metasystems are to become a reality, so we 
should make it work for us. 



4.3 An Evaluation Environment 

As noted earlier, it will be nearly impossible to run real benchmark suites across 
large-scale metasystems. Therefore, we opt for simulation to evaluate schedulers. 
A proposed evaluation environment for schedulers is the WARMstones project 
(WARM = Wide-Area Resource Management, and stones is from the traditional 
naming of “stones” for benchmark suites). This is somewhat of a misnomer, 
as WARMstones will encompass a simulation and evaluation environment in 
addition to a benchmark suite, and part of the WARMstones environment will 
simulate and evaluate scheduling for local systems. 

The primary components of WARMstones include a benchmark suite, an 
implementation toolkit for schedulers, a canonical representation of metasys- 
tems, and a simulation engine to evaluate execution of a suite of applications on 
a metasystem using a particular scheduler. As we have already described, the 
benchmark suite will initially comprise combinations of micro-benchmarks and 
existing applications. Rather than executing these applications directly, we will 
represent them using annotated graphs, and simulate the execution by interpret- 
ing the graphs. Legion program graphs [34] are well-suited to this purpose. Users 
will also be able to produce representations of their own applications. 

The implementation toolkit will allow users to implement particular schedul- 
ing algorithms for simulation and evaluation. Again, we draw on earlier expe- 
rience, and plan to use a system much like that in the MESSIAHS distributed 
scheduling system [9] . 

To evaluate a scheduler, we will first run the scheduler on the benchmark 
suite to produce mappings of programs (graphs) to resources, and then run the 
simulator using the resultant mapping and a system configuration (in canonical 
form) as input. The representation will encapsulate both the local infrastructure 
(workstations, clusters, supercomputers) and the overall structure of the meta- 
system. The system will also employ multiple levels of detail in the simulation. 
For example, depending on how much precision is required and how much time 
and computational resources are available, we could simulate every packet be- 
ing transmitted across a network, or we can simply assume a simple model and 
estimate the communication time. 

This evaluation system will enable evaluations of multiple scenarios and fac- 
tors, e.g.: 

— I have devised a new scheduling algorithm. I want to evaluate it using the 
benchmark suite and a range of “standard” machine representations, so that 
I can make “apples-to-apples” comparisons to other schedulers. 

— I have an application I want to run, and I know the target system envi- 
ronment. I can use the evaluation system to help me select among several 
candidate scheduling algorithms. 
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— I want to enable run-time selection of “good” scheduling algorithms. I can 
make off-line runs iterating across the benchmark suite, the set of available 
schedulers, and a number of “standard” system configurations. I can store 
these results in a table, and at run time I can look up the closest matches 
on application structure and system configuration to find a scheduler which 
should work well for me. 

— I have the choice of purchasing machine A or machine B for my system. I can 
generate program graphs for my top five applications and test them using an 
implementation of my current scheduler on system configurations including 
both machine choices. 

5 Conclusions 

Standardization and benchmarking are important for progress because without 
them research is harder to perform and results are harder to compare. While 
there is always place for improvements and additions, it is also necessary to 
draw the line and decide to standardize now. It seems that we can immediately 
do so for parallel systems, as enough data is available, at the same time leaving 
the door open for changes as more data becomes available in the future. The 
definitive definition and updates will be posted as part of the Parallel Workloads 
Archive [19]. 

Benchmarking for meta-scheduling is harder, because even less data is avail- 
able, and the environment is more complex. It therefore seems that the best 
current course of action is to try and reduce the complexity by partitioning the 
problem into sub-problems, and trying to deal with each one individually. Thus 
application schedulers will be evaluated using simplified models of resource avail- 
ability provided by separate machine schedulers, and machine schedulers will be 
evaluated using rudimentary models of the requests generated by application 
schedulers. As larger implementation materialize and data is accumulated, inte- 
grated evaluations may become possible. 
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Abstract. Gang scheduling is an effective scheduling policy for multi- 
processing workloads with significant interprocess synchronization and 
is in common use in real installations. In this paper we show that signifi- 
cant improvement in the job slowdown metric can be achieved simply by 
allocating a different number of quanta to different rows (control groups) 
depending on the number of processes belonging to jobs in a given row. 
Specifically, we show that allocating the number of quanta inversely pro- 
portionally to the number of processes per job in that row results in 
20 - 50% smaller slowdowns without significantly affecting mean job re- 
sponse time. Incorporating these suggestions in to real schedulers would 
require the addition of only a few lines of simple code, hence this work 
should have an immediate practical impact. 



1 Introduction 

Ousterhout proposed the notion of co-scheduling to efficiently support inter- 
process synchronization when multiprogramming a set of parallel jobs on a 
multiprocessor machine [22]. Goscheduling strives to ensure that all processes 
belonging to a job are scheduled at the same time. Subsequent work has gener- 
alized and refined the coscheduling (now often called gang scheduling) concept 
[5,6,8,9,11,14,15,19,23,24,25,26]. Gang scheduling schemes are a practical result 
of the multiprocessor scheduling community and have been adapted for inclusion 
in several production systems including the Intel Paragon [4], GM-5 [3], Meiko 
GS-2, multiprocessor SGI workstations [I] and the IBM SP2 [17,18]. 

Work has been done to examine the impact of quantum size allocations on 
mean job response times and slowdowns for different quantum sizes [25,26]. Little 
work has been done on investigating how quantum allocation can be modified 
to improve mean job slowdowns and response times. In this paper we show 
that negatively correlating the number of allocated quanta with the number of 
processes per job can significantly reduce mean job slowdown. Specifically, we 
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show that allocating additional quanta to rows (or control groups) containing 
jobs with a small number of processes results in the best overall performance 
observed in our experiments. 

The past work most similar to our work is that of Squillante et al, and Wang 
et al. [25,26,27]. In [25] the modeling methodology allows for gangs of different 
classes to be allocated different quantum sizes, but results presented do not 
investigate this issue, instead the the work assumes all classes have the same 
quantum size. 

In [27] the authors set quantum sizes per class based on different offered work 
per class. We do not consider this aspect in our studies. Note, the authors state 
that it would be interesting to investigate the effect of allocating multiple quanta 
to short running jobs. We provide experimental results for such considerations 
and generalize them to include simpler gang scheduling algorithms. 

In [26] the authors consider dividing processing power unequally among 
classes, via different size quantums, stating possible motivations of allocating 
larger quantum to classes with higher context switch costs and classes demand- 
ing shorter response time. In figure 8 the authors show how the mean number 
of class i jobs in the system decreases as the fraction of processing power (quan- 
tum length relative to other class quantum lengths) allocated to class i jobs is 
increased. The experiment does not show what happens to the overall number 
of jobs in the system as quantum lengths are increased for one particular class. 

Our work differs in that we fully explore the effect of varying quantum al- 
location per control group size (for DHC). We consider how favoring one class 
affects all classes by reporting mean slowdowns and response times. We provide 
numerous simulation results to make a case for proper quantum allocation. 

In addition to considering the DHC algorithm, we consider the effect of dif- 
ferent quantum allocations with simpler gang scheduling schemes such as Matrix 
[22] and LRS [5]. These simpler algorithms are especially relevant since many 
current production level schedulers use variants of these simpler algorithms. Fi- 
nally, our work also provides a more exhaustive comparison of DHC with Matrix 
and LRS by considering more workloads than considered in [5] and comparing 
DHC with optimized versions of LRS and Matrix. 

Aside from extending the understanding of gang scheduling, this work is 
directly applicable to existing systems. The work involved to modify existing 
schedulers to incorporate these findings should be trivial. Thus, for almost no 
implementation work commercial systems with correlated workloads similar to 
our synthetic workloads can decrease slowdowns by 10 to 50% . Furthermore, 
for some workloads we show that mean job response time can be decreased by 
50% relative to equal allocation per row (or control group). 



2 Scheduling Algorithm Description 



In this section we review the three previously proposed policies considered and 
detail how we allocate quantum sizes. 
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2.1 Matrix 



Basic Algorithm Description 

This is the scheduling algorithm as defined in the seminal paper by Ouster- 
hout [22] . A newly arriving job is placed in the first row with a sufficient number 
of idle processor slots to accommodate the job. Note, the slots need not be con- 
tiguous within the row. Upon job completion row unification is not performed as 
it was shown to have little effect [22] since alternate selection fills in holes when 
scheduling a row. 

The policy runs jobs by rotating through the rows of the matrix in a round- 
robin fashion. If there are idle slots within a row scheduled for execution, other 
rows are scanned (in round robin fashion from that row) to see if any other job(s) 
can be completely co-scheduled in the hole. We do not allow fragments to be 
run. In most of the algorithm variants below, a row is allocated multiple quanta. 
Alternate selection is performed as soon as a job completes so that processors 
are not left idle during the remaining quanta allocated to the slot. Note however 
that alternate selection is performed only between quanta. 

Quantum Allocation 

We consider a family of Matrix algorithms which differ in how many quanta 
are allocated to each row. Let Ji be the number of jobs in row i. Let Qi be 
the number of quanta allocated to row i. Let St be the threshold size, dividing 
small jobs from large jobs. Let P be the number of processors. Note for our 
experiments, St was set to 8, and P was set to 128, but St should be larger if P 
were increased. We define the following policies: 

Matrix-EQL: Qt = 1 Vz. 

Matrix- S: Qt = Ji. 

Matrix- Sj (where j is an integer): 






j, if all jobs in row have < St processes 
1, otherwise 



Matrix- Lj (where j is an integer): 



( 1 ) 



Q^ 



1, if all jobs in row have < St processes 
j, otherwise 



(2) 



Matrix-S and Matrix-Sj favor jobs with a small number of processes. In 
Matrix-S, the number of quanta allocated equal the number of jobs in the row, 
hence rows containing many small jobs will receive additional processing power. 
In Matrix-Sj, rows in which all jobs have 8 or fewer processes are allocated j 
quanta, a row containing one or more jobs of 9 or more processes is allocated one 
quantum. The parameter j allows us to vary how much preference is given to 
jobs with a small number of processes. In our experiments we consider Matrix- 
S2, Matrix-S8, and Matrix-S16. Note, the value of 8 for differentiating between 
large and small jobs was arbitrarily chosen based on intuition for the workload 
descriptions. 
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The Matrix-Lj policies are defined in a similar fashion but give preference to 
jobs with a large number of processes. Rows with at least one large job are given 
additional quanta. 



2.2 LRS 

LRS [5] differs from Matrix only in how newly arriving jobs are allocated within 
the matrix. Jobs that have more than 8 processes are allocated processors from 
left to right while the rest are allocated processors from right to left. 

Quantum Allocation 

The family of polices, LRS-EQL, LRS-S, LRS-Sj, and LRS-Lj, are defined in 
the same way as the family of Matrix algorithms above. 

2.3 DHC 

This algorithm is similar to the minimum fragmentation with selective disabling 
version of the DHC algorithm proposed by Feitelson and Rudolph [9] . It is based 
on a hierarchy of controllers. Each block of 2* PEs is assigned a controller that 
coordinates their activities. Controllers at higher levels of the hierarchy coor- 
dinate those at the lower levels. In addition, the controllers at each level have 
lateral connections among them. An arriving job is mapped to a controller that 
has just enough processors under it to satisfy the job demand. The job is mapped 
to the controller that has the least load among the controllers at that level. If the 
controller controls more than one processor, it partitions the threads of the job 
as follows: Let 2*“^ < t <= 2* be the total number of threads of the job, li the 
load on the left child and I 2 the load on the right child. Assuming w.l.o.g. that 
h < I 2 , the first 2*“^ threads are mapped to processors controlled by the left 
child and the rest are mapped to processors controlled by the right child. The 
scheduling is done in rounds. In each round, jobs mapped to controllers at level 
i are scheduled ahead of those mapped to controllers at lower levels. Alternate 
selection is done as discussed for the previous algorithms. 

Quantum Allocation 

We consider a family of DHC algorithms which differ in how many quanta 
are allocated to each control group. Consider control group i, such that 2® is 
the largest number of processes a job belonging to control group i may have. 
Jobs in each control group are allocated Qi quanta. Furthermore, let total equal 
the total number of processes in the system and let riim denote the number of 
processes of job m in the control group. We define the following policies: 
DHC-EQL: Qi = I Vi. 

DHC-Sj (for integer j): 






where k such that 2^ 



({k-i + l)j, Vz,j = l 

[ max{l, {k - i)j), Vz,j > 1 
the number of processors. 



(3) 
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DHC-Lj (for integer j): 

ff + 1, Vf,j = l 

Q^ = \ (4) 

[ max{l, i ■ j), Vi,j > 1 

DHC-original: Q, = 

Note that this is equivalent to the original definition: Qi = ^ , where tm 

is the number of processes of job m. 

Thus, DHC-Sj allocates control groups with small jobs a larger number of 
quanta and the magnitude of the difference is determined by parameter j. Con- 
versely, DHC-Lj allocates control groups with large jobs a larger number of 
quanta. DHC-original is the original quantum size allocation algorithm specified 
[9]. Feitelson and Rudolph state that preference should be given to jobs with a 
large number of processes. Note, this is contrary to the findings of this paper. 
Regardless, we find that the DHC-original policy approximates DHC-EQL at 
higher loads. 



3 Simulation Methodology, Workload Models, and Metric 

We have simulated a 128 processor machine using the Simpack [12,13] simulation 
package. Our simulator models job arrivals with an exponential mean inter- 
arrival time. Upon job arrival the number of processes and total job demand are 
determined as described below. We do not simulate interprocess synchronization 
or context switch overheads. All three policies considered have been shown to 
support interprocess synchronization well and hence such details would distract 
from the emphasis of this paper. Tuning gang scheduling algorithms for context 
switch overheads has been considered in [26] and inclusion here would obscure 
the main focus of the paper. 

Confidence intervals were collected using batch means. We ran 60 batches of 
500 jobs per batch resulting in at most 10% response time confidence intervals 
at a 95% confidence level. The first 500 jobs simulated were discarded to achieve 
a warm start. 

3.1 Definitions and Metrics 

We define a job to be composed of one or more processes. We will refer to jobs 
with a small number of processes as ’’small” jobs vs ’’large” jobs, and jobs with 
short execution times as ’’short” jobs vs ’’long” jobs. Job response time is defined 
as the difference between job completion and arrival times. The slowdown of a 
job is defined as the ratio of the response time of a job over the the execution 
time of the job if run in isolation. 

We collected both mean job response times and mean slowdowns. We focus 
primarily on mean slowdowns. The mean job response time metric is dominated 
by the long jobs thus obscuring the impact of the scheduling algorithm on short 
jobs. Conversely, the slowdown metric emphasizes the penalty paid for slowing 
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down short jobs. Since users often are actively waiting for short jobs to complete 
it makes sense to focus on the slowdown metric as the primary comparison 
metric. We note that there is not agreement yet on the most relevant comparison 
metric [7], hence we consider both slowdown and response time. 

3.2 Workload Models 

We considered 4 different workloads. The first three are minor variants of the 
workloads proposed by Leutenegger and Vernon [20] and subsequently used in 
other works [2,21]. The workloads differ in the degree of correlation between 
the number of processes and total job demand. The fourth workload is that of 
Feitelson [5]. 



PMF 5% 





Fig. 1. PMF Of Workloads: Top Left: X = 0.05, Top Right: X = 0.1, Bottom 
Left: X = 0.25, Bottom Right: Feitelson Workload 



Geometric-Bounded N, and Workloads 

In these workloads we first determine the number of processes, n, for a job 
and then the job demand. As hypothesized in [20], and seen in many measured 
workloads on recent systems [10,16], a significant portion of jobs have parallelism 
equal to the number of processors. In addition, a significant portion of jobs have 
parallelism equal to half the number of processors (assuming the number of 
processors is a power of two) Accordingly, we assume the number of processors 
is a power of 2 and let X be the percentage of jobs whose parallelism is equal 
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to 2* and 2*“^, where 2* is equal to the number of processors. For example, for 
X = 10, 10% of the jobs have the number of processes set to 128 and 10% of the 
jobs have the number of processes set to 64. The remaining 100— (2*X) percent 
of the jobs have the number of processes drawn from a geometric distribution 
with mean n. In all of our studies we set n = 4. If the sample drawn from the 
distribution exceeds the number of processors we truncate the sample to equal 
the number of processors. We consider values of AT = 5%, 10%, and 25%. Note, 
recent measurement studies have also indicated more probability mass at 16 
and 32 processes. Rather than try to mimic a specific distribution we choose 
to capture the spirit of the distribution. To address this concern we also study 
the workload of Feitelson, described below, which includes more of these smaller 
jobs. 

In Figure 1 we plot the probability mass function of the number of processes 
per job for each of the workloads considered in this paper. Note that when X is 
increased the spikes at 64 and 128 processes per job increase accordingly. 

Once the number of processes, n, is obtained, the job demand is determined. 
In all cases we study we assume job demand is positively correlated with the num- 
ber of processes. The job demand is drawn from a two-stage hyper-exponential 
distribution with mean D, where D is determined as described below. The co- 
efficient of variation of the job demand is set equal to 2 in all of our reported 
experiments. The overall coefficient of variation is much greater than 2, due to 
the linear/super linear dependence of job demand on processors [20]. For exam- 
ple, the coefficient of variation of total job demand for the iV^ workload is (8.0, 
5.7, 3.6) for the workloads with X equal to (0.05, 0.1, 0.25). 

We consider three cases: 

N D = n* d 
N'^-5 D = n'^-^ * d 
D = n^*d 

where d is a constant and n is the number of processes. Parameter d is set to 10 
in all of our experiments. 

Feitelson Workload 

This is the workload proposed in [5] . The workload is based on observations 
from 6 production parallel machines. It contains a significant portion of jobs 
that have a parallelism that is a power of two. The correlation between number 
of processes and total job demand is between N and iV^. The workload also 
includes repeated executions of certain jobs. Arrivals are distributed according 
to a Poisson process. A more detailed description and the code for generating the 
workload is obtainable from www.cs.huji.ac.il/labs/parallel/workload/ 
wlmodels . html. 

4 Results 

In this section we present our simulation results. In all experiments presented, 
bars for specific algorithms presented left to right correspond to entries in the 
legend from top to bottom. 
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Fig. 2. Slowdowns, N'^ X = 10 Workload: Top Left: LRS, Top Right: DHC, 
Bottom Left: Matrix 



4.1 Geometric-Bounded Workload 

In Figure 2 we plot the slowdown of the variants of the LRS, DHC and Matrix 
algorithms for the N'^ workload with X, the percentage of jobs whose number 
of processes is set to 64 and 128, equal to 10. In each figure the left set of bars 
is for a utilization of 90% and the right set of bars is for a utilization of 70%. 
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In each case we can observe that the variants that give preference to small jobs 
perform better than those that give preference to large jobs. But, giving too 
many quanta to small jobs is counter productive as exhibited by the alg-S8 and 
alg-S16 (where alg = LRS, Matrix, or DHC) policies. In general, variations that 
give extreme preference to large/small jobs perform poorly. In addition, equal 
allocation per row (or control group for DHC) results in larger slowdowns than 
alg-S. As shown in Figures 3, 4 and 5 the slowdowns of LRS-EQL, DHC-EQL, 
and Matrix-EQL are 36%, 14% and 39% higher than the slowdowns of LRS-S, 
DHC-Sl, and Matrix-S at a 90% utilization and 25%, 39% and 25% higher at a 
70% utilization. 
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Fig. 3. DHC Slowdown Ratios, N'^ X = 10 Workload: Left: 70% Utilization, 
Right: 90% Utilization 



To understand why alg-S performs better than alg-EQL, consider the case 
when multiple small jobs are packed per row (or control groups of the same level) 
whereas only one or two large jobs are packed per row. If each row is given equal 
allocation of processing power then the jobs with a large number of processes 
receive a disproportionately larger share of processing power relative to small 
jobs. As shown in [20], this is counter productive for workloads where the job 
demand is correlated with the number of processes. In [20] it was suggested that 
allocating equal processing power per job is beneficial in the absence of more 
detailed job knowledge. The LRS-S, Matrix-S, and DHC-Sl algorithms attempt 
to allocate equal processing power per job but are not able to given the rigid 
packing constraints of the algorithms. For example, if in Matrix there are two 
rows, one with a job of 128 processes and one row with 4 jobs of 32 processes each. 
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Fig. 5. Matrix Slowdown Ratios, N'^ X = 10 Workload: Left: 70% Utilization, 
Right: 90% Utilization 
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then the first row is allocated 1 quantum while the second is allocated 4 quanta. 
In this case, each job is allocated equal processing power. On the other hand, if 
the second row has three jobs with 64, 32, and 32 processes each, the second row 
is allocated 3 quanta. In this case the job with 64 processes is allocated more 
processing power than the other jobs. Hence, truly equal allocation per job is 
not possible given the packing constraints assumed. 

Consider the performance of the DHC-original quantum allocation policy in 
Figure 2. DHC-Original performs worse than DHC-Ll, DHC-L2 and DHC-L4 at 
70% utilization, but better at 90% utilization. This is because DHC-Original has 
less disparity in quantum allocations (to jobs with varying degrees of parallelism) 
at higher utilizations. For example, consider the following two scenarios: 

— Assume there are only two jobs in the system, ji with 128 threads and j 2 
with 1 thread. In this case, j\ gets a quantum size of 128/129 * c while ]2 
gets a quantum size of 1/129 * c. 

— Assume the following jobs in the system: j\ with 128 threads, j 2 with 64 
threads, /a with 32 threads, and j 4 and js with 16 threads each. In this case 
ji gets a quantum size of 128/256 * c and jobs j 2 through js also get an 
effective quantum size of 128/256 * c due to alternate scheduling. 

Thus, the original proposed DHC quantum size allocation algorithm approaches 
DHC-EQL at high loads, but is more similar to DHC-Lx (for some x) at lower 
loads. 

To make the relative slowdowns more clear, we plot the slowdown ratios 
for the DHC algorithm in Figure 3. We plot the slowdown ratio of each DHC 
variant relative to DHC-Sl. The three groups of bars from left to right are for 
X = 5, 10 and 25 %. The extreme variants viz. DHC-L8, DHC-L4 and DHC-S8 
are not shown since they compress the scale of the figure. DHC-EQL results in 
slowdown that are (31, 39, 24) % larger than DHC-Sl for X = (5, 10, 25%) at 
a 70% utilization, and (17, 14, 13) % larger at a 90% utilization. Overall, the 
performance of DHC-Sl is comparable or better than all other variants. In some 
cases DHC-S2 performs slightly better. 

For the LRS and Matrix algorithms the difference between alg-EQL and alg- 
S is even greater (at a 90% utilization) than for DHC as can be seen in Figures 4 
and 5. LRS-EQL results in slowdown that are (29, 25, 13) % larger than LRS-S 
for X = (5, 10, 25%) at a 70% utilization, and (67, 36, 28) % larger at a 90% 
utilization. Matrix-EQL results in slowdown that are (29, 25, 14) % larger than 
Matrix-S for X = (5, 10, 25%) at a 70% utilization, and (90, 39, 25) % larger 
at a 90% utilization. Thus, even bigger improvements can be expected when 
modifying an existing scheduling algorithm based on Matrix or LRS to allocate 
the number of quanta proportional to the number of jobs in the row rather than 
equally per row. 
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4.2 Geometric-Bounded Workload 

The qualitative results obtained for this workload were similar (within 10%) to 
those obtained for the Geometric-Bounded N'^ workload. Experimental results 
are not included for purposes of brevity. 
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Fig. 6. Slowdowns, N X = 10 Workload, Utilization = 70% 



4.3 Geometric-Bounded N Workload 

In Figure 6 we plot the slowdowns for the N correlated workload with X = 10 
at 70% utilization. The left set of bars is for the LRS variants, the middle set 
for the DHC variants and the set on the right is for the Matrix variants. Unlike 
in the previous cases, variants that give equal preference to all jobs have better 
performance compared to variants that give preference to large/small jobs. This 
behavior occurs for this workload since job execution time (if run alone) is not 
correlated to job parallelism. Thus, giving preference to small jobs is counter- 
productive. 

4.4 Feitelsou96 Workload 

In Figure 7 we plot the slowdowns of variants of the LRS, DHC and Matrix 
algorithms respectively for the Feitelson96 workload at 70% and 80% utilization. 
For this workload too, giving preference to small jobs results in better slowdowns. 
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This is especially true at an 80% utilization where the slowdowns of LRS-EQL 
and Matrix-EQL are 192% and 188% higher than those of LRS-S and Matrix-S. 
Note, for this workload there is little difference in DHC-EQL and DHC-Sl. 



4.5 Response Times 

In Figure 8 we plot the mean response times for the N'^ and N workloads as- 
suming X = 10. For the Matrix and LRS policies there is only a minor difference 
(up to about 20%) between the mean response times of the variants that give 
preference to large jobs and those that give preference to small jobs. Conversely, 
DHC exhibits a serious degradation in mean job response time going from DHC- 
EQL to DHC-S8. We conjecture that this is because giving multiple quanta to 
the smallest control groups results in idling of processors since larger alternates 
can not fill the holes left when scheduling the small control groups. The problem 
is magnified at higher utilizations and also for the N workload. 

In Figure 9 we plot the mean response times of the various algorithms (and 
variants) for the Feitelson96 workload at 70% and 80% utilization. For the Feit- 
elsoii96 workload, giving preference to small jobs has a significant improvement 
in the MRT for the LRS and Matrix algorithms at higher utilization. The MRT 
of Matrix-EQL (LRS-EQL) is 75% (79%) larger than the MRT of Matrix-S 
(LRS-S) . Again, the MRT of DHC is significantly increased by giving additional 
quanta to small control groups. 

Based on the mean response times of DHC shown in Figures 8 and 9, we 
suggest that DHC-EQL be used instead of DHC-Sl. Another approach would be 
to detect when the smaller control groups do not fill the processors and allocate 
fewer quanta. 



4.6 Relative Performance of Algorithms 

In this section we compare the relative performance of the 3 algorithms for 
the workload. A comparison of LRS-EQL, Matrix-EQL and DHC-original 
for the Feitelson96 workload can be found in [5], but that comparison did not 
consider the LRS-S and Matrix-S variants proposed in this paper. We plot the 
slowdowns and mean response times of the EQL and S variants (SI in case of 
DHC) for the Geometric-Bounded workload in Figure 10. This workload 
was not considered in previous papers on DHC. We choose only the EQL and S 
variants since their performance is comparable or better than the other variants 
of each algorithm. 

DHC-Sl has the best slowdowns among these variants at both 70% and 
90% utilization, but larger mean response times. LRS performs better than Ma- 
trix both in terms of slowdowns and mean response times. DHC-EQL performs 
marginally better than LRS-S. Thus, for this workload the DHC-EQL and LRS-S 
policies are the best and about equal. 

In Figure 11 we plot the slowdowns and mean response times of the EQL 
and S variants (SI for DHC) for the Feitelson96 workload. Again, for this work- 
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Fig. 9. Mean Response Times, Feitelson96 Workload, Left: 70% utilization. 
Right: 90% utilization 



load, the performance of DHC-EQL is comparable to that of LRS-S in terms of 
slowdown and mean response times. 

5 Conclusions 

In this paper we showed via a simulation study that for LRS and Matrix al- 
locating quanta inversely proportional to mean job parallelism can significantly 
decrease mean job slowdowns with little impact on mean job response time. Note, 
the amount of work to modify existing production gang scheduling policies to 
incorporate these findings would be trivial and should result in a significant 
performance gain. 

We consider three gang scheduling policies: Matrix, LRS, and DHC. We have 
run simulations on four different workloads varying numerous job parallelism 
parameters. Total job demand is super linearly correlated with the number of 
processes for most of the workloads considered. 

In general, we show that mean job slowdown can be decreased by 20 to 50% 
for the Matrix and LRS algorithms simply by letting the number of quanta 
allocated to a row equal the number of jobs in the row rather than equal al- 
location per row. Equal allocation per row gives a dis-proportionate share of 
processing power to jobs with a large number of processes. These large jobs also 
have longer execution times, hence this is poor allocation choice. By allocating 
additional quanta to rows containing jobs with less parallelism we do a better 
job of allocating equal processing power per job, resulting in reduced mean job 
slowdowns. 
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For the DHC policy, we have shown that allocation of additional quanta 
to control groups containing jobs with a small number of processes sometimes 
decrease mean job slowdown compared to equal allocation per control group, but 
at the expense of significantly larger mean job response times. Furthermore, the 
performance of the quantum size allocation policy originally proposed performs 
similar to equal allocation per quantum. 

Finally, we compared the performance of LRS-S, Matrix-S, and DHC-EQL. 
In earlier work LRS-EQL, which is worse than LRS-S, was shown to result in 
larger slowdowns than DHC. We show that LRS-S and DHC-EQL have similar 
performance across all studies considered. The DHC algorithm has the advantage 
of being less centralized than LRS, making it more attractive in a distributed or 
massively parallel setting, but the LRS-S algorithm is considerably simpler to 
implement and achieves comparable performance. 
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Abstract. This paper presents a parallel process scheduling method for 
the AP/Linux parallel operating system. This method relies on 2 schedul- 
ings; local scheduling on each processor and global scheduling which is 
called moderate co-scheduling. Moderate co-scheduling schedules simul- 
taneously parallel processes on each processor by controlling priorities of 
parallel processes. This method differs from gang scheduling in that it 
does not promise the running of a parallel process on all processors at 
the same time. Moderate co-scheduling only suggests a suitable current 
process to the local scheduling. However, this is good solution for fine 
and coarse grain parallel processes, because Moderate co-scheduling tells 
the timing to schedule simultaneously for fine grain parallel processes 
(tightly-coupled processes on each processor, which requires quick and 
frequent communication), and local scheduling can yield CPU time when 
coarse grain parallel processes (loosely-coupled processes on each proces- 
sor, which cause long wait and less frequent communication) must wait 
for long time. The method is implemented using AP1000+ special hard- 
ware. We call the implementation “Internal synchronization” which uses 
the synchronized clock. The co-scheduling skew of the implementation 
was about 2% in the period of moderate co-scheduling was 200ms. 



1 Introduction 

This paper describes the installation and performance of scheduling for efficient 
execution of parallel processes on a parallel OS called AP/Linux[l] that is de- 
signed for API 000-1- parallel computers. Research on parallel process scheduling 
has widely been conducted for efficiently operating parallel computers and pro- 
viding responsive service to individual users. One approach to achieve these is 
“space sharing” scheduling, with which several parallel processes are laid out 
efficiently in the space of processors. Space sharing scheduling has been con- 
ducted actively since the beginning of the 1990’s in line with research on regional 
management (partitioning algorithms) that fits the respective network topolo- 
gies. Another approach is “time sharing” scheduling, with which responsivity is 
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improved by preempting parallel processes. One more approach that has been 
proposed is combined scheduling, with which space sharing and time sharing are 
combined to make the most of their respective advantages. AP/Linux adopts the 
combined scheduling because of its high efficiency. 

When preempting a parallel process by time sharing scheduling, the commu- 
nication properties of the parallel program to be executed must be taken into ac- 
count. In the case of fine grain parallel processes that conduct small-scale commu- 
nications frequently, it is considered more efficient to schedule parallel processes 
of the respective processors to be scheduled simultaneously (co-scheduling [2]) 
and wait for messages by “busy wait.” On the other hand, in the case of coarse 
grain parallel processes that conduct large-scale communications less frequently, 
it is considered better to improve the throughput by context switching, since it 
takes a long time to wait for messages by “busy wait.” There is a communica- 
tion method that deals with both fine grain and coarse grain parallel processes 
appropriately, with which context switching is conducted after waiting for mes- 
sages several times by “busy wait” [3] . The communication library of AP /Linux 
is equipped with this method, and its effect has already been confirmed [4]. Even 
if this method is adopted, however, it is still important in the case of fine grain 
parallel processes to conduct co-scheduling for using the “busy wait” effectively. 

A method for co-scheduling all parallel processes simultaneously is gang 
scheduling [.5], with which a certain CPU time is allocated to parallel processes 
by starting them simultaneously, and execution of other processes is prohibited 
to guarantee busy wait communications. However, since strict gang scheduling 
blocks other processes, it is not possible to switch over to another process even 
when daemon processing is urgently required or when blocking occurs during I/O 
processing on one of the processors. In addition, when issuing context switching 
of parallel processes, gang scheduling may require an save/restore mechanism to 
avoid crashes and delivery mistakes of communication messages on the network. 

In this paper, the authors propose a method for conducting co-scheduling 
while relaxing the strict conditions of gang scheduling and controlling the order 
of priority of parallel processes managed by the local scheduler in the respective 
processors. Although this method gives priority of processing to one selected 
parallel process, it does not impede the processing of other processes that need 
to be processed urgently due to daemon or other reasons. In addition, when the 
target parallel process is not executable due to I/O waiting or other reasons, 
another parallel process can be processed. By actually installing this scheduling 
method, the authors will show its superiority in terms of overall efficiency, despite 
a slight sacrifice of communication queuing. In contrast to strict gang scheduling, 
this method is called moderate co-scheduling. 

The mechanism to change the priority of parallel processes simultaneously is 
the central issue in installation. AP/Linux is equipped with internal synchroniza- 
tion that uses a synchronized clock of the respective processing elements, taking 
advantage of the hardware properties of AP1000-I-. With internal synchroniza- 
tion, the accountable time of the priority of parallel processes is investigated 
every time the local scheduler conducts re-scheduling, and the priority is given 
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to another parallel process when the accountable time is over. The performance 
and problems of this method in installation will be identified later. 

Hereafter, the outline of AP/Linux will be explained in Section 2; the sched- 
uler to be installed on AP/Linux and its method will be described in Section 3; 
the results of performance evaluation conducted by running actual parallel pro- 
cesses will be presented in Section 4; a comparison of these results with other 
relevant research will be shown in Section 5; and conclusions will be discussed 
in Section 6. 

2 Outline of AP/Linux 

AP/Linux[l] is a parallel operating system for APIOOO-I- parallel computers. 
API 000-1- provides an original operating system called CellOS. When execut- 
ing parallel processes, CellOS is loaded along with execution format codes and 
provides only the single user single process environment which does not reside 
permanently on the processor of AP1000-I-. AP/Linux is a parallel operating 
system developed by the CAP group of the Australian National University to 
overcome this fault. 




Fig. 1. Overview of API 000-1- . 



Figure 1 shows the overview of AP1000-I-. With this computer, the unit of 
processing elements is called the cell, and each cell has a SuperSparc which 
runs at 50 MHz. As the interface with the outside, BIF (Broadcast InterFace) 
is connected to SBUS of the host computer (Sparc Station). Between cells and 
on the network connected to the host computer are T-net, B-net, and S-net. 
T-net, which is a two-dimensional torus network, connects adjacent cells with 
an inter-cell linkage bandwidth of 25 MB/s, and provides worm hole routing. For 
inter-cell communications, T-net provides message transmission (send, receive) 
and remote memory access (put, get). B-net, which is a broadcast network with 
a band of 50 MB/s, connects all cells to the host computer. S-net, which is a 
synchronous network, connects all cells to the host computer. 
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By the bootap+ program, AP /Linux loads the kernel onto each cell from the 
host computer through BIF that is connected to B-net, and mounts disks in each 
cell or the disk on the host computer onto the file system. When inetd daemon is 
activated on each cell, logging-in from external computers becomes feasible. The 
ordinary Linux environment is provided when logged into cells. Since bootap-|- 
provides a virtual console session to each kernel, it cannot terminate a process 
while AP/Linux is active. Meanwhile, the environment can also be used as an 
ordinary decentralized environment, since TCP/IP can be used on B-net. 

As environments for parallel programming, both the MPI library[4] and the 
AP library, the latter of which is compatible with a library exclusively provided 
by AP1000-I-, can be used. They are communication libraries, with which T-net 
can be used on the user level. Transmission mistakes of messages sent by such 
communication libraries are unlikely, since messages are tagged, and destination 
processes can be judged after they are stored in the ring buffer in the cells of 
the receivers. Thus, context switching of parallel processes will not affect such 
messages on the network. 

A polling/signal system is installed on these libraries for detecting messages, 
with which the throughput is improved by issuing context switching in the signal 
waiting mode, after waiting for a certain period of time by polling (busy wait). 
When the waiting time is short, however, it is better to wait by polling with- 
out issuing context switching, because it will expedite communications and help 
avoid frequent context switching (processor thrashing). To conduct polling effi- 
ciently, all parallel processes need to be activated simultaneously. To meet this 
condition, it is required to issue context switching by co-scheduling the parallel 
processes that are dispersed in the respective cells. 

The original AP/Linux has a simple parallel process scheduler. Since this 
scheduler does not provide space sharing, parallel processes are laid out from 
Cello without variation, which causes load concentration. In addition, 
co-scheduling of parallel processes is dependent upon the local scheduler of CellO 
that has the process IDs of all parallel processes. Specifically, with this scheduler. 
Cello is subject to a huge load, and space sharing is infeasible. 

3 Parallel Process Scheduling of AP/Linux 

In this paper, the authors propose a scheduling method to be installed on 
AP/Linux, which provides scheduling by combining space sharing and time 
sharing. 

Time sharing is entrusted to the server of the host computer. To run parallel 
processes, a message is sent to the server of the host computer, and the server 
determines the cell realm where parallel processes are executed. The creation and 
layout management of parallel processes are described in detail in Sections 3.1 
and 3.2. 

Preempting for time sharing is conducted by using the Linux kernel function 
on the respective cells. Parallel processes can be co-scheduled by controlling the 
order of priority of the parallel processes in the local scheduler in the respective 
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processors. When conducting co-scheduling, the order of priority is controlled 
by giving an ordinary order of priority only to the parallel process that is the 
target of execution, while lowering the priority of other parallel processes which 
are not the targets of execution. At present, the values of the order of priority 
of the parallel processes that are not the targets of execution are determined 
by deducting 15 from their respective values of priority. The priority of paral- 
lel processes is not raised to avoid impeding the processing of daemon that is 
indispensable for UNIX processing and daemon (paralleld) that creates paral- 
lel processes. When the target parallel process does not become executable by 
the local scheduler, another parallel process becomes executable instead. This 
method is exclusively for managing the priority, and does not guarantee that a 
parallel process is executed infallibly immediately after co-scheduling. If there is 
some other process with priority higher than that of parallel processes, the local 
scheduler will give priority to the former. 

Although process thrashing may occur, this can be avoided because parallel 
processes are the only processes that require a long CPU time, and they are 
under priority control. Although daemon processing of sequential processes is 
required in case of emergency, it scarcely affects the parallel processes because 
the time required for the daemon processing is short. In addition, the throughput 
can be improved by giving priority to daemon processing. In other words, since 
the processing of “paralleld,” which creates parallel processes and exists in each 
processor as daemon, is not hindered by a running parallel process, parallel pro- 
cesses can be created promptly. If the processing is hindered by a parallel process, 
the throughput cannot be improved because it takes time to create new parallel 
processes. Furthermore, when processing several coarse grain parallel processes 
at the same time, which are large-scale but have low communication frequency, 
other parallel processes can be made executable by switching over to the signal 
waiting mode and activating context switching, by which the throughput can 
also be improved. 

The method of simultaneously changing the order of priority for co-scheduling 
is described in detail in Section 3.3. 



3.1 Creation of Parallel Processes 

Parallel processes are created and managed by the command and servers as 
shown in Table 1. 

Figure 2 shows an example of executing a parallel program that is made by 
using the AP library and the MPI library by issuing the prun command. The 
circles in the figure represent cells, and both kernel and paralleld exist in the 
respective cells. By the -n option, the prun command designates the required cell 
size and the target parallel process to be executed. The prun command can be 
issued from both the host computer and other computers that are communicable 
by socket . This figure shows a case example of a demand of creating three parallel 
processes. Taskl demands the execution of creation in the cell group of 2x2, 
while task2 and task3 demand it in the cell groups of 3x3 and 1x3, respectively. 
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Table 1. Command and servers for parallel process creation and management. 



name 


type 


function 


machine 


prun 


command 


parallel process request 


any machine 


pds 


server 


prun & paralleld management 


host 


bootap+ 


server 


allocation management 


host 


paralleld 


server 


create parallel process 


each cell 


kernel 


interrupt 


priority control 


each cell 



prun -n 2*2 task1 
prun -n 3*3 task2 
prun -n 1*3 task3 




0 1 2 



Fig. 2. Parallel process creation. 



The prim command requests pds (parallel daemon server) on the host com- 
puter to secure and create cells. Pds checks the appropriateness of both the 
requested cell size and the parallel processes. If they are judged appropriate, pds 
requests cell layout to bootap-l-. Upon determination of the locations of layout 
by bootap-h, pds requests paralleld on the cells, where the parallel processes are 
scheduled to be laid out, to create parallel processes. To create parallel processes, 
paralleld uses the revised version of the clone function that is in the standard 
package of Linux. With this clone function, the same process ID (PID) can be 
designated on the respective cells from outside of kernels, and parallel processes 
can be discriminated among cells. The PID of the parallel process utilizes the 
upper half of the entire PID, while that of the sequential processes utilizes the 
lower half. Pds manages the PID of parallel processes. Parallel processes with 
designated PID are created on the respective cells, and are put into the respective 
local process queues. 

Paralleld relays the standard input and output between prun and parallel pro- 
cesses. Input from the terminal which is executing prun is transmitted through 
paralleld, and output from parallel processes is also transmitted to the terminal 
through paralleld. 
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When a parallel process is terminated, the process transmits a message of 
termination to paralleld. Upon receipt of this, paralleld disconnects the standard 
input/output session between prun and the parallel process, then sends a mes- 
sage of parallel process termination to pds. Upon receipt of this message, pds 
notifies the termination to bootap-h, and releases the realm to which the parallel 
process was allocated. 

3.2 Layout of Parallel Processes 





Slice 2 




Round 

Robin 



Real Parallel Computer 



Fig. 3. Combination of time sharing and space sharing. 



The layout of parallel processes is managed by a method proposed by the 
authors[6], which is a combination of partitioning and virtual parallel computers. 
Virtual parallel computers with a linkage system identical to an actual parallel 
computer are prepared for time sharing, while space sharing is conducted on 
the respective virtual parallel computers using partitioning algorithms. Such a 
virtual parallel computer is called a slice. Extra slices are prepared for allocating 
the excessive parallel processes that cannot be allocated to existing slices. When 
conducting time sharing, actual parallel computers are allocated to these slices 
per certain time by round robin. The efficiency in using the processor space was 
improved by such time sharing, with which one slice was prepared for a parallel 
process that uses the entire group of processors, and other slices were allocated 
to other parallel processes that do not require processors much. 

Figure 3 shows the state of slices in time sharing. Rectangles on the respec- 
tive slices represent the allocation of parallel processes. The rate of processor 
utilization in one slice can be raised by partitioning algorithms. Parallel pro- 
cesses that can be allocated to several slices exist in the slices, and can increase 
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the rate of processor utilization. A parallel process that exists in only one slice 
is called a single task. A parallel process that can exist in more than one slice is 
called a multiple task. In Figure 3, the rectangles colored in thin gray represent 
single tasks, while rectangles colored in thick gray represent multiple tasks. To 
improve response, slices without single task were eliminated. 

Bootap+ is equipped with the function of parallel process layout. Since 
bootap+ occupies BIF when AP/Linux is in operation, bootap+ needs to be 
equipped with a function of processing by interruption. 

AP1000+ is a two-dimensional torus link. With AP/Linux, however, it is 
supposed to be a mesh link, for which many partitioning algorithms are proposed. 
For the partitioning algorithms. Adaptive Scan[7] was adopted because of its high 
efficiency. 



3.3 Synchronization 




Fig. 4. Internal synchronization. 



For simultaneously controlling the priority of parallel processes on slices, the 
authors propose a method of internal synchronization that takes advantage of 
the hardware properties of AP1000-I-. With internal synchronization, a clock, 
with which time synchronization is physically guaranteed, is used. Since the 
clock ticks inside the respective cells are operated at 80 ns, synchronization can 
be conducted without relying upon the host computer. 

Figure 4 shows the schematic diagram of internal synchronization. With in- 
ternal synchronization, slice queues with slice information are inside kernels 
in addition to ordinary process queues. With internal synchronization, paral- 
lel processes are laid out by bootap-|- of the host computer. Every time layout 
is conducted, bootap-|- interrupts the respective cells to renew the slice queue 
information. 
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Regarding time adjustment of co-scheduling, CellO becomes the representa- 
tive and notifies the present time to all cells by interruption when slices are 
created. This time is the standard time, and the priority of parallel processes is 
switched over every time the time allocated to each parallel process has elapsed. 
Every time re-scheduling is conducted, the system checks whether or not the 
time for changing the priority of parallel processes has elapsed in the respective 
cells. If the time is already over, the priority of the parallel process that has been 
executed is lowered, and the priority of the next parallel process to be executed 
will be returned to normal. 

When the processing of a parallel process is terminated, the process sends a 
message of termination to paralleld. Upon receipt of this message, paralleld dis- 
connects the standard input/output session between prun and the parallel pro- 
cess, then sends a message of parallel process termination to pds. Upon receipt 
of this message of termination, pds notifies it to bootap-l-. Bootap-|- releases the 
realm of the parallel process, and notifies the slice information, that has been 
renewed by interruption, to the kernels of the respective cells. Based on this 
information, the kernels renew the slice queue information. 

The time intervals for switching the priority are set to be sufficiently longer 
than the time required for re-scheduling. If re-scheduling fails to operate for some 
reason, a parallel process that should be executed at this point will be calculated 
from the slice queue to switch over to the processing of the appropriate parallel 
process. 



Problems There is a problem with internal synchronization, i.e., co-scheduling 
skew may occur because re-scheduling does not occur simultaneously in the 
respective processors. Nevertheless, the influence of the co-scheduling skew is 
considered to be small because the only processes that require processing are 
parallel processes except for daemon and because parallel processes are controlled 
by priority switching. The extent of the co-scheduling skew will be identified 
through actual installation. 



4 Performance Evaluation 

To evaluate the installed scheduling, parallel processes were input for measuring 
the performance of execution. AP/IOOO-I- of 16 cells (8x2) was used for the per- 
formance evaluation, and the priority of parallel processes was set to be switched 
over at every 200 ms. The period of co-scheduling was reasonable to compare 
to Score-D[8] (the period was 50ms-200ms, CPU was 200Mhz PentiumPro) and 
the period was enough to hide the overhead as mentioned in section 4.3. 

The focus of the measurements was to compare AP/Linux with CellOS, and 
to identify the efficiency when executing several fine grain parallel processes 
that wait for communication in the busy wait mode, and coarse grain parallel 
processes that wait for communication in the signal mode. 
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Unfortunately we did not have suitable benchmark programs to compare fine 
and coarse grain parallel process. So we offered original program to check the 
performance. 

4.1 Comparison with CellOS 

First of all, one parallel process was executed to compare AP/Linux with CellOS. 

A program with message patterns of “pipelined round robin” and “one to 
all” was adopted for processing fine grain parallel processes for performance 
evaluation. “Pipelined round robin” transmits data one after the other just like 
a pipeline to travel around all cells (8x2). “One to all” sends a message from one 
cell to all other cells (8 x 2)-l, and the receivers of the message return answerbacks 
without any particular processing. Both of them conduct data transmission of 
10 bytes. 

Figure 5 shows the results of executing a parallel process by both “pipelined 
round robin” and “one to all” using CellOS and AP/Linux, respectively. The 
parallel process was executed five times for each to confirm the operation to be 
stable. In Figure 5, the X axis represents the frequency of the iteration of data 
transmission, while the Y axis is the time. Since the frequency of the iteration of 
data transmission was checked at every power of 4, both the X axis and Y axis 
are represented using the log scale. Figure 5 also shows the time required from 
the commencement of input of the process from the host computer until the 
termination of processing (total processing time, which is indicated as “total” in 
the figure) and the time required from the commencement of actual processing 
on AP1000+ until termination (process processing time, which is indicated as 
“process” in the figure). The difference between the total processing time and 
the process processing time is the overhead between preprocessing and post 
processing by OS. The overhead of co-scheduling by AP/Linux is included in 
the process time. 

Figure 5 shows that there is no major difference between CellOS and 
AP/Linux in regard to the process processing time required by the parallel 
process itself. The same applies to “pipelined round robin” and “one to all”; 
there was no difference in the performance due to communication patterns. Al- 
though the overhead of co-scheduling is included in the process processing time 
of AP /Linux, it is within the negligible range in comparison with the execution 
by CellOS. Thus, these results show that the execution performance does not 
change greatly even if AP/Linux is used instead of CellOS. AP/Linux is even su- 
perior to Cellos in terms of usability, considering the overhead that arises when 
starting CellOS and the capability of AP/Linux to execute several processes 
concurrently. 

The total processing time from the input of the parallel process from the 
host computer until the termination of processing differs greatly between CellOS 
and AP/Linux. This is attributable to the difference in the overhead between 
preprocessing for starting the parallel process and post processing. The values 
of the total processing time are almost the same regardless of the frequency of 
the iteration of data transmission, although this may not be clear in the figure 
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Table 2. Overhead to run a parallel process. 



pipelined round robin 


one to all 




Average (sec) 




Average (sec) 


CellOS 


5.38 


CellOS 


5.39 


AP /Linux 


0.83 


AP /Linux 


0.77 



because the values are shown using the log scale. Table 2 summarizes the time 
required for the overhead, which was about 5.4 seconds with CellOS and about 
0.8 seconds with AP /Linux. The difference is attributable to the difference in the 
method of running parallel processes. With CellOS, both the execution format 
code of the parallel process and CellOS itself are transmitted to the respective 
cells, when starting parallel processing, and the respective states need to be 
initialized for executing a parallel process. With AP/Linux, since the execution 
format code of the parallel process is managed by demand paging, it is not 
loaded from the disk if already loaded on the memory. Since the processing 
was conducted several times in the measurement, the time required for loading 
the code from the disk was eliminated. In addition, since the parallel process 
of AP/Linux can use the dynamic library, there is no need to load the code 
itself although there is overhead of dynamic link. AP/Linux, which is equipped 
with these functions of UNIX, is advantageous when executing several parallel 
processes. 

4.2 Processing of Several Parallel Processes 

The effects of space sharing/time sharing when executing several fine grain par- 
allel processes and the effect of moderate co-scheduling when executing coarse 
grain parallel processes are shown below. 



Effect of Space Sharing/Time Sharing Several fine grain parallel processes 
as described before were input by giving variations to their cell realms. Only the 
case with “one to all” is discussed here, since all the parallel processes executed 
by “pipelined round robin” and “one to all” presented the same trend. Figure 
6 and Figure 7 shows the results of execution in two graphs of demanded cell 
realms, one for 2x2 and the other for 4x2. In Figure 6 and Figure 7 , the X axis 
represents the number of processes, while the Y axis shows the total processing 
time. Parallel processes were executed 5 times for each to confirm the operation 
to be stable. Both of these graphs prove that the space sharing and time sharing 
are effective. Parallel processes that demand the cell realm of 2x2 can be laid out 
spatially by 4 parallel processes per slice, thus increasing the total processing 
time in units of 4. In the same manner, the total processing time of parallel 
processes that demand the cell realm of 4x2 increases in units of 2. 

Table 3 tabulates the values of Figure 6 and Figure 7. These tables show the 
state of slice allocation per number of process, the mean average of the total 
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Fig. 7. Effect of space sharing and time sharing for one to all (4x2). 
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processing time required, and the relative elongation based on total processing 
time per process. It is known from these tables that the total processing time 
increases only slightly when the number of parallel processes that can be exe- 
cuted concurrently is increased by space sharing without changing the number 
of slices. The increase in the total processing time is attributable to the delay in 
creating processes and the slight elongation of the time required for processing 
the respective parallel processes. However, since this increase in time is slight in 
comparison with the time required for processing parallel processes, it is negli- 
gible and can be offset by the effect of space sharing. 

The relative elongation of the total processing time is shown to be propor- 
tional to the increase of slices. This is super linear in the results of 4x2 (Table 3), 
which is due to the fact that the processing of the management of parallel pro- 
cesses on the host computer can be overlapped with the execution of parallel pro- 
cesses. This shows that the parallel processes on other slices are not obstructed 
even when there are several slices. This result also shows that the influence of 
co-scheduling is small. The performance of co-scheduling will be described in 
Section 4.3. 



Table 3. Result of one to all. 



(2x2) 



slice 


process 


sec times 


1 


1 


2.78 


1.00 




2 


2.95 


1.06 
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3.13 


1.13 




4 


3.38 


1.22 
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5.49 


1.97 
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5.63 


2.03 
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5.86 


2.11 



(4x2) 



slice 


process 


sec times 


1 


1 


5.05 


1.00 




2 


5.28 


1.05 


2 


3 


9.73 


1.93 




4 


9.85 


1.95 


3 


5 


14.34 


2.84 




6 


14.47 


2.87 


4 


7 


19.11 


3.78 



Effect of Moderate Co-Scheduling With AP/Linux, the time allocated to 
parallel processes is not strictly secured, unlike the case with gang scheduling. 
Accordingly, when the parallel process being executed becomes I/O waiting, it 
is possible to move to another process by context switching. To confirm this, the 
execution of a coarse grain parallel process, with which the state is switched to 
signal waiting without waiting by busy wait, was studied. In the environment 
where several parallel processes are running, processing of one parallel process 
is likely to be overlapped with that of other parallel processes. 

As an example of coarse grain, a parallel process that makes a message of 
2,000 bytes travel 1,000 times around all cells (8x2) by round robin was used. 
Figure 8 shows the results of the total processing time. For reference, the results 
of execution using CellOS are also shown in the figure. Here, the overhead re- 
quired for starting CellOS is excluded. With CellOS, communication is waited 
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by busy wait. When the round robin of 2,000 bytes is one process, the process- 
ing time of waiting by busy wait was almost equivalent to that of waiting by 
switching over to signal. 



"CellOS" 

AP/Linux 



4 

processes 



Fig. 8. Elapse time of round robin which transfer 2000 byte. 



Although the effect of moderate co-scheduling is not apparent when only a 
single parallel process is executed, the effect is evident when several parallel pro- 
cesses are executed. Despite the overhead of context switching to signal waiting, 
the entire processing time is not greatly increased since other processes can be 
executed meanwhile. Figure 8 shows that the total processing time does not in- 
crease greatly because of the overlapped execution of several parallel processes. 
With strict gang scheduling, the effect of such overlapping cannot be gained 
because other processes cannot be processed during the time allocated to one 
parallel process. 

Figure 9 shows the speed of creating several parallel processes. This figure 
shows the difference between the time when the first parallel process is started 
and the time when the last parallel process is started, when several parallel 
processes are input. The difference in the starting time increased almost linearly 
as the number of input processes increased. Process creation is subject to the 
influence of local scheduling because it is dependent upon the scheduling of 
paralleld. If the timing of scheduling a parallel process is delayed, another parallel 
process is executed. Since the parallel process that missed the chance must wait 
until its paralleld becomes the next target of scheduling, the difference in the 
starting time is great in some cases. There are cases where the time for creating 
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two or three processes is almost equivalent, which is because paralleld received 
the demand of parallel process creation consecutively during execution. Figure 
9 shows that the speed of parallel process creation is not severely affected, thus 
proving that the time is appropriate for a scheduling method in conformity to 
the style of UNIX. 



4 5 

processes 



Fig. 9. Speed of parallel process creation. 



4.3 Performance of Synchronization 

The co-scheduling skew by internal synchronization averaged about 10 ms with 
the standard deviation of around 5 ms. Although this average of co-scheduling 
skew appears to be great for the switching of 200 ms, it does not affect the 
scheduling itself greatly because it is more similar to a time shift rather than 
skew. Although the problem with the skew is rather the standard deviation, it 
is considered allowable as a co-scheduling skew, since 5 ms accounts for 2.5% of 
200 ms where switching of priority takes place. 

On the other hand, the time required for processing by interruption in the 
respective kernels was about 0.3 ms for 2 processes and about 0.4 ms for 8 
processes. Although they are added to parallel processing as overhead every time, 
they are considered negligible because they account for 0.4% of 200 ms. This is 
also considered negligible when compared with the time lag of synchronization. 
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5 Related Research 

Implicit co-scheduling [3] and dynamic co-scheduling [9, 10] are resemble to our 
moderate co-scheduling. Implicit co-scheduling and dynamic co-scheduling are 
demand-based co-scheduling, that is, they relay on a local scheduler to run the 
parallel process when it receive a message. They suppose that the scheduler 
quickly dispatches parallel processes after receiving a message. We believes our 
local scheduler cannot dispatch so quickly. Therefore we offered multi-processor- 
wide scheduling. The multi-processor- wide co-scheduling is used to synchronize 
the scheduling of parallel processes on all processors. It is beneficial for fine grain 
parallel processes, because it raises the probability of running suitable parallel 
processes on the allocated processors at the same time. It causes small loss time 
of message receive for frequent messages. Furthermore moderate co-scheduling, 
unlike gang scheduling, allows to yield process time when a parallel process must 
wait long time. It caused high through-put of parallel processes. 

6 Conclusions 

In this paper, the authors proposed a method of scheduling, with which space 
sharing and time sharing are combined, for AP /Linux which is an OS for parallel 
computers. Scheduling here is in principle moderate co-scheduling, with which 
the order of priority of the specific parallel processes is controlled per fixed time. 
In addition, as a method to change the order of priority of parallel processes 
simultaneously for conducting co-scheduling, the authors proposed and installed 
internal synchronization which takes advantage of the synchronous clock that is 
a hardware property of API 000-1- . 

Together with the communication library for polling/signal switching, this 
method allowed efficient execution of both fine grain parallel processes and coarse 
grain parallel processes. It was confirmed that co-scheduling allows efficient op- 
eration of busy wait communications of fine grain parallel processes, which wait 
for communication by busy wait, even when several of those are input. In ad- 
dition, it was confirmed that the overall processing efficiency of coarse grain 
parallel processes, which wait for communication by signal, could be improved 
by issuing context switching to execute another parallel process. 

Although the problem with internal synchronization is the co-scheduling skew 
caused by the failure of synchronization in the respective processors when con- 
ducting re-scheduling, it was concluded that the time lag was negligible since it 
accounted for around 2.5% of 200 ms where priority switching takes place. 

The important issue of a parallel OS is to establish both a communication 
library and a scheduler in consideration of the properties of applications and the 
structures of the hardware available. A parallel OS designer must decide whether 
to use waiting by busy wait or context switching based on the communication 
patterns of the applications. If co-scheduling can efficiently be installed on us- 
able hardware, it is sufficient to consider its introduction for fine grain parallel 
processes. At present, AP/Linux is compatible with both fine grain and coarse 
grain parallel processes, thus allowing efficient processing of a wide variety of 
application programs. 
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Abstract. This paper presents some ideas for efficiently allocating re- 
sources to enhance the performance of gang scheduling. We first intro- 
duce a job re-packing scheme. In this scheme we try to rearrange the order 
of job execution on their originally allocated processors in a scheduling 
round to combine small fragments of available processors from differ- 
ent time slots together to form a larger and more useful one in a single 
time slot. We then describe an efficient resource allocation scheme based 
on job re-packing. Using this allocation scheme we are able to decrease 
the cost for detecting available resources when allocating processors and 
time to each given job, to reduce the average number of time slots per 
scheduling round and also to balance the workload across the processors. 



1 Introduction 

With the rapid developments in both hardware and software technology the 
performance of scalable systems such as clusters of workstations/PCs/SMPs has 
significantly been improved. It is expected that this kind of system will dominate 
the parallel computer market in the near future because of the continued cost- 
effective growth in performance. For this type of machine to be truly utilised as 
general-purpose high-performance computing servers for various kinds of appli- 
cations, effective job scheduling facilities have to be developed to achieve high 
efficiency of resource utilisation. 

It is known that coordinated scheduling of parallel jobs across the proces- 
sors is a critical factor to achieve efficient parallel execution in a time-shared 
environment. Currently the most popular scheme for coordinated scheduling is 
explicit coscheduling [4], or gang scheduling [3]. With gang scheduling processes 
of the same job will run simultaneously for only certain amount of time which 
is called scheduling slot. When a scheduling slot is ended, the processors will 
context-switch at the same time to give the service to processes of another job. 
All parallel jobs in the system take turns to receive the service in a coordinated 
manner. 
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Because there are multiple processors in a system, the resource allocation 
will include both space partitioning and time sharing in the gang scheduling 
context. One disadvantage associated with the conventional gang scheduling for 
clustered (or networked) computing systems is its purely centralised control for 
context-switches across the processors, that is, a central controller is used to fre- 
quently broadcast messages to all the processors telling which job should obtain 
the service next. When the size of a system is large, efficient space partitioning 
policies are not easily incorporated mainly due to this frequent signal broad- 
casting. Currently most allocation schemes for gang scheduling only consider 
processor allocation within the same time slot and the allocation in one time 
slot is independent of the allocation in other time slots. To ensure a high effi- 
ciency of resource utilisation, however, we believe that allocation of resources in 
both time and space has to be considered simultaneously. 

We have designed a new coscheduling scheme called loose gang scheduling, or 
scalable gang scheduling [7,8]. Using our scheduling scheme the disadvantages as- 
sociated with conventional gang scheduling are significantly alleviated, especially 
the requirement for frequent signal-broadcasting. The basic structure of this 
scheduling scheme has been implemented on a 16-processor Fujitsu AP1000-I-. 
Although the function of the current coscheduling system is limited and needs to 
be further enhanced, the preliminary experimental results show that the scheme 
works as expected [6]. This enables us to consider the allocation of resources in 
both space and time at the same time in a more effective way to significantly 
enhance the performance of gang scheduling. 

In this paper we present some resource allocation schemes for achieving high 
system and job performance. We first give some simple examples in Section 2 to 
show that regularity is still an important factor in designing resource allocation 
policies even for a parallel system with an unstructured interconnection or fully 
connected pattern between processors, and that resource allocation in different 
time slots should be considered at the same time to achieve a higher efficiency 
in resource utilisation. It is known that allocation schemes which take regularity 
into account can cause the problem of fragmentation. In Section 3 we introduce 
a job re-packing scheme. Using this scheme small fragments in different time 
slots can under certain conditions be combined together to form a much larger 
and useful one in a single time slot. Based on job re-packing we then describe 
in Section 4 a simple and practical resource allocation scheme which considers 
the workload conditions in both space and time dimensions at the same time 
(rather than just in each individual time slot) when allocating resources to a 
given job. With this scheme we are able to reduce the average number of time 
slots per scheduling round, to balance workloads across the processors and thus 
to achieve high system and job performance. 



2 Motivation 

To design efficient processor allocation policies for conventional distributed- 
memory MPPs we should consider two important factors, regularity and locality, 
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in order to minimise communication costs and to avoid communication con- 
tention between different jobs. On certain parallel machines such as clusters of 
workstations or PCs or SMPs interconnected via a Gigabit Ethernet or a cross- 
bar switch network, however, the communication costs may not depend on the 
location of processors. Thus regularity and locality may become less important 
issues when only space partitioning is considered. In this case a simple random 
allocation scheme may be preferable, that is, we can arbitrarily choose a set of 
available processors to a given job regardless of their localities. With this simple 
scheme we may alleviate the problem of fragmentation caused by those allocation 
schemes taking regularity into consideration. 

The question is if the simple random allocation scheme can also be incor- 
porated in gang scheduling to efficiently allocate resources in each time slot. It 
seems that the answer to this question is positive. When a new job arrives, we 
first search to see if there are enough available processors in an existing time slot. 
If there are, a set of available processors regardless of their localities is allocated 
to that job. A new time slot will be created if we cannot find enough available 
resources in any existing time slot. 
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^2 



Fig. 1. One possible situation caused by using the random allocation scheme. 



Ji 


Ji 








Ja 






J3 




J3 






J3 
















J2 




J2 




Jl 


Jl 




Jl 






Jl 



Pi P2 P3 Pi P5 P& Pi Ps 



The problem, however, is not associated with how to allocate resources in 
each time slot to new jobs, rather it is with how to effectively reuse freed re- 
sources due to the job termination if there are no new arrivals. Let us consider 
a simple example. Originally the system was very busy and four time slots had 
to be created. After a certain period of time small jobs were all terminated and 
there left only a few large jobs scattered across the processors in different time 
slots, as shown in Fig. 1. We know that the processes of the same parallel job 
need coordination during the computation. Because processors are arbitrarily 
allocated to jobs in each time slot, one possible situation, as depicted in Fig. 1, 
is that neither the total number of time slots can be reduced, nor can the freed 
resources be efficiently reallocated to those running jobs even though proces- 
sors in the system are only active less than fifty percent of time on the average. 
The simulation results presented in [1] show that this situation can significantly 
be improved if a well known regular allocation strategy such as the first fit, or 
the buddy is adopted instead. The main reason is that, when regularity is taken 
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into account, the number of unification counts can greatly be increased, jobs 
will have better chances to run in multiple time slots and then both the perfor- 
mance of parallel jobs and the efficiency of resource utilisation can be enhanced. 
(The number of unification counts is defined as the number of times the same 
set of processors in two time slots are united and allocated to a single parallel 
job [1] and a parallel job running in more than one time slot is sometimes called 
multi- slice job [5].) 

It can be seen from the above discussion that to design efficient job allo- 
cation strategies for gang scheduling regularity is an important factor. Many 
existing schemes for space partitioning take regularity into consideration. By 
simply adopting one of such schemes and allocating resources independently in 
each time slot as we conventionally do, however, efficient resource utilisation 
may still not be guaranteed. 

First regular allocation schemes can have the problem of small fragments, 
that is, a fragment of available processors in a time slot is too small to be 
allocated to any job. The second main problem associated with the conventional 
allocation method is that space partitioning and time sharing are not considered 
simultaneously. Consider two space partitioning schemes, that is, the first fit 
and the buddy, which are widely used in Gang scheduling for resource allocation 
in each time slot. Although the internal fragmentation caused by the buddy 
allocation scheme may be more serious than the external fragmentation caused 
by the first fit, simulation results show that the DHC allocation scheme [2] 
which is based on the buddy performs better than those based on other regular 
allocation schemes if slot unification is allowed [1]. The main reason may be as 
follows: With the buddy allocation scheme each time processors are divided into 
two subsets of equal size. If the same policy is applied to every time slot, two 
jobs allocated to different subsets of processors, whether in the same time slot 
or not, will never overlap with each other. When a job is terminated, the freed 
resources in one time slot are more likely to be united with those in another time 
slot and then reallocated to a single job. Thus space partitioning and time sharing 
are implicitly (though not thoroughly) considered at the same time. Since the 
allocation of resources is essentially independent in different time slots and small 
internal fragments cannot be grouped together, however, using a simple buddy 
based allocation scheme may still be difficult to achieve an optimal solution in 
resource utilisation. 



3 Job Re-packing 

We cannot totally avoid the problem of fragmentation in each time slot when 
regularity is taken into consideration. However, we have to adopt regular allo- 
cation schemes in order to achieve a better system performance as discussed in 
the previous section. The question is thus if we can find a way to alleviate the 
fragmentation problem. In this section we introduce a job re-packing scheme. 
Using this scheme we are able to combine certain small fragments from different 
time slots into a larger and more useful one in a single time slot. 
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In the following discussion we assume that processors in a parallel system 
are logically organised as a one-dimensional linear array. Note that the term 
one-dimensional linear array is purely defined in the gang scheduling context. 
A logical one- dimensional array is defined as a set of N processors which are 
enumerated from 1 to (or from 0 to fV— 1) regardless of their physical locations 
in the system. Thus we can simply use a two-dimensional global scheduling 
matrix such as the one in Fig. 1. Using the term linear array we mean that 
only consecutively numbered processors can be allocated to a given job. Thus 
regularity is only associated with the global scheduling matrix, but not with the 
physical locations of processors. 



Js 


Js 


Js 


Je 


Jq 




Jr 


Jr 


iJ') 


iJ’) 


iJ') 


{J') 


J4 


J4 


(j") 


(J") 


Jl 


Jl 


J2 


J2 




J3 


J3 


J3 



P\ P2 P3 P4 P5 P& P7 Ps 

(a) 



& 


Js 


Js 


Js 


Je 


Je 




Jr 


Jr 


& 


Jl 


Jl 


J2 


J2 


J4 


J4 


















J3 


J3 


J3 




Pi 


P2 


P3 


P4 


P5 


Pe 


P7 


Ps 



(b) 



S3 














Jr 


Jr 


S2 


Jl 


Jl 


J2 


J2 


J4 


J4 






Si 


Js 


Js 


Js 


Je 


Jq 


J3 


J3 


J3 




Pi 


P2 


P3 


P4 


Ps 


Ps 


P7 


Ps 










(c) 








S2 


Jl 


Jl 


J2 


J2 


J4 


J4 


Jr 


Jr 


Si 


Js 


Js 


Js 


Je 


Je 


J3 


J3 


J3 




Pi 


P2 


P3 


P4 


Ps 


Ps 


P7 


Ps 



(d) 



Fig. 2. Job re-packing to reduce the total number of time slot in a scheduling 
round. 
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We first give a simple example, as shown in Fig. 2 , to demonstrate the basic 
ideas of our re-packing scheme. In this example the system has eight processors 
and originally three slots are created to handle the execution of nine jobs. Now 
assume that two jobs J' and J” in slot S2 are terminated. When using the 
idea of unification, certainly jobs J\ and J2 in slot (or J5 on S3) and job 
J7 on S3 can occupy the freed resources in S'2 to become multi-slice jobs. This 
might be the best assignment if the performance of these jobs were just required 
to be enhanced. However, the assignment may not be optimal from the overall 
system performance point of view because there are still some small fragments 
which are not utilised and other running jobs in the system cannot obtain any 
benefit. Things become worse if there arrives a new job which requires six or more 
processors - the fourth time slot has to be created and then the performance of 
the existing jobs will be degraded. 

With certain rearrangement or re-packing of jobs, however, we can eliminate 
one time slot under the same situation. The procedure is as follows: First we 
reallocate jobs Ji and J2 from slot S\ to slot S'2, as shown in Fig. 2 (b). After 
this step two fragments in Si and S2 are combined into a larger one in slot Si . 
This new fragment can further be combined with the one in slot S3 by taking 
jobs J5 and Jq down to slot Si, as shown in Fig. 2 (c). Finally we simply bring 
job Jj down to slot S2. Slot S3 now becomes empty and can then be removed. 
It is obvious that this type of job re-packing can greatly improve the overall 
system performance. Note that during the re-packing only processes of the same 
job are shifted from one time slot to another. Therefore, this kind of re-packing 
is actually to rearrange the order of job execution on their originally allocated 
processors in a scheduling round and there is no process migration between 
processors involved. 

Because processes of the same job require to coordinate with each other 
during the computation, all processes of the same job must be shifted together 
to the same time slot at the same time. Thus there is a restriction to shift jobs 
during the re-packing. In Fig. 2 (a), for example, processes of J4 on S'2 cannot be 
shifted to either Si or S3 because the size of the fragment in either slot is not 
big enough to accommodate all the processes of J4. A shift is said to be legal 
if all processes of a job are shifted to the same slot at the same time. In job 
re-packing we try to use this kind of legal shift to rearrange jobs between time 
slots so that small fragments of available processors in different time slots can 
be combined into a larger and more useful one. 

When processors are logically organised as a one-dimensional linear array, 
we have two interesting properties which are described below. 

Property 1 . Assume that processors are logically organised as a one-dimensional 
linear array. Any two adjacent fragments of available processors can be grouped 
together in a single time slot. 

Proof: It is trivial when two fragments are adjacent in the same slot. We thus 
assume that the two adjacent fragments are in different time slots, either sharing 
a common vertical boundary, or partially overlapping with each other. 
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Fig. 3. Small fragments combined into a larger one in a single time slot. 

Define a cut as a vertical line which is set between two processors and across 
certain (not necessarily consecutive) time slots in the global scheduling matrix 
to cut those slots into two parts. A cut is said to be a legal cut if it does not 
cut existing jobs into two parts, that is, all the existing jobs will have their 
two boundaries of the allocated processors on the same side of the cut. Let us 
introduce two legal cuts through two time slots in a given system. Then all the 
jobs bounded by these two cuts in one slot can legally exchange their positions 
at the same time with their counterpart in the other slot. This is because every 
job bounded between these legal cuts will have all its processes in the bounded 
region and they will still be in the same slot after the exchange. In the following 
discussion all cuts will be considered as legal cuts. 

Because it is a one-dimensional linear array, its left (or right) end will form 
a natural boundary and no jobs can come across. We can thus set our first 
legal cut there, as shown in Fig 3(a). Our second cut will be set between the 
two fragments. It is also a legal cut because no jobs can reside on both sides 
of this cut when regularity is considered in the processor allocation. Thus jobs 
bounded by these two legal cuts can exchange their positions, which enables the 
two fragments to be grouped together in a single slot, as depicted in Fig. 3(b). 

In the above example the two small fragments share a common boundary. 
When two fragments partially overlap with each other, our second legal cut can 
be set anywhere between the overlapped region and then a larger fragment can 
be produced. □ 
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The proof is constructive, that is, it describes an algorithm for job re-packing. 
In this algorithm we use the left (or right) end of the array as a legal cut, then 
set another legal cut between the adjacent fragments and finally exchange jobs 
bounded by the two cuts. 

We can continue this process in Fig. 3(b) by setting another cut and then 
four original small fragments in the two slots will be reduced to two larger ones 
in two such re-packing steps. 

When we can introduce more than one cuts in the middle of the array through 
two time slots at the same time, however, the above algorithm will not be very 
efficient. In Fig. 3, J\ and J 3 are swapped twice and then back to their original 
positions. Thus these exchanges are redundant and should be avoided. It is easy 
to verify that the following algorithm will work more efficiently: First find all the 
legal cuts which divide the slots into several regions and then swap only once 
the jobs between the two slots in alternating regions. The proof is simple and 
omitted. For the same problem as that in Fig. 3 we can simultaneously introduce 
two cuts to divide the array into three regions and then need only to swap jobs 
in the middle region to obtain the same result, as depicted in Fig. 4. 
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Fig. 4. A more efficient way for job re-packing between two time slots. 



Property 2. Assume that processors are logically organised as a one-dimensional 
linear array. If every processor has an idle fragment, jobs in the system can be 
re-packed such that all the idle fragments will be combined together in a single 
time slot which can then be eliminated. 

Proof: It is actually a simple extension of Property 1. We have already given 
an example, as depicted in Fig. 2. We can easily prove that the property holds 
for general case by the following simple induction. 
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Assume that the first k small fragments on the first k processors have been 
combined together as a single fragment of size k in time slot Si and that the 
fragment on processor Pk+i is in time slot Sj. Setting a cut at one end of the 
array and a cut between the two processors Pk and Pk+i, we can then combine 
the two fragments into one of size A:-l-l in either Si or Sj according to Property 1. 

□ 

In the above discussion we assumed that processors are organised as a one- 
dimensional linear array. To merge two adjacent fragments actually only one cut 
is required because the boundaries of the array can be utilised as natural legal 
cuts. For a one-dimensional ring in which two fragments at the two ends of a 
linear array are able to be combined into one in the same time slot, however, the 
situation becomes a bit more complicated. Because there are no natural bound- 
aries like those in a linear array, to merge two adjacent fragments in different 
slots we have to find two legal cuts. The first one can be considered as a cut used 
to break the ring and then the properties discussed above for one-dimensional 
array can be applied. 

Note that the re-packing may increase the scheduling overhead on a clustered 
parallel system because a message notifying the changes in the global scheduling 
matrix should be broadcast to processors so that the local scheduling tables 
on each processor can be modified accordingly. However, there is no need to 
frequently re-pack jobs between slots. The re-packing is applied only when the 
working condition is changed, e.g., when a job is terminated, or when a new job 
arrives. In these cases certain local scheduling tables need to be updated even 
without job re-packing. Thus the extra system cost introduced by the re-packing 
may not be high. In the next section we shall see that, when job re-packing is 
allowed, the procedure for searching available resources can be simplified and 
then the overall system overhead for resource allocation may even be reduced. 

4 Resource Allocation Based on Job Re-packing 

Conventionally the procedure for allocating processors to a new job is first to 
search if there is a suitable subset of available processors in an existing time 
slot. The job will then be allocated if such a suitable subset can be found. The 
purpose of this search is to try to avoid creating a new time slot which is not 
necessary. Because the search is done slot by slot and only local information on 
each time slot is considered, however, the results can often be far from optimal. 
This search procedure may also become expensive for large systems. 

Consider a simple example depicted in Fig. 5. In the figure there are currently 
four time slots in a scheduling round. Assume that there is now a new job which 
requires five processors. Using the conventional method the system first starts 
search for a subset of consecutive idle processors of size greater than or equal to 
five in an existing time slot. In this particular example such a suitable subset of 
available processors cannot be found and then a new time slot has to be created. 
If job re-packing is allowed, however, it is easy to see that the new job can be 
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allocated to processors P3 to P7 (or P4 to Ps) in time slot S 2 by simply shifting 
job J4 to slot S 3 . Thus the search effort is totally wasted and the creation of a 
new time slot is also unnecessary. 

Based on job re-packing we can obtain a new scheme for resource allocation. 
This new scheme can significantly simplify the search procedure and make better 
decisions for resource allocation. 
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Fig. 5. Resource allocation using a workload vector WLV. 



In addition to the global scheduling matrix, we introduce a workload vector 
(WLV) of length equal to the number of processors in the system, as depicted 
in Fig. 5. An integer value is assigned to each entry to record the number of 
idle slots on the associated processor. For example, the entry corresponding to 
processor Pi is given a value 0 because there is no idle slot on that processor, 
while the last entry value of the vector is equal to 3 denoting there are currently 
three idle slots on processor Pg. 

For the conventional allocation method adding this workload vector may not 
be able to assist the decision making for resource allocation. This is because the 
information contained in the vector does not tell which slot is idle on a processor, 
but processes of the same job have to be allocated in the same time slot. With 
job re-packing, however, we know that on a one-dimensional linear array any two 
adjacent fragments of available processors can be grouped together into a single 
time slot according to Property 1 discussed in the previous section. To search 
for a suitable subset of available processors, therefore, we only need to count 
consecutive nonzero entries in the workload vector if job re-packing is allowed. 
Thus the problem for searching on the entire two-dimensional scheduling matrix 
to find a suitable subset of available processors becomes a simple one-dimensional 
search problem. It is easy to see in Fig. 5 that, because there are six consecutive 
nonzeros in the workload vector, with a simple job re-packing process the new 
job which requires five processors can be allocated without creating a new time 
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slot. Therefore, problems caused by the conventional method can significantly 
be alleviated. 

Our new scheme for resource allocation consists of three main steps. First 
we search in the workload vector WLV for a required number of consecutive 
nonzeros. A new time slot is created only if the required number of consecutive 
nonzeros in that vector cannot be found. This step determines a suitable subset of 
consecutive processors to be allocated to a new job. Then we trace adjacent idle 
fragments just within this subset of processors and group them into a single time 
slot through proper re-packing procedures such as those discussed in the previous 
section. This will determine in which time slot the new job resides. Finally we 
update the scheduling matrix and also local scheduling tables on each processors 
if there is any. Using this scheme the search for a suitable subset of available 
processors is simplified and the total number of time slots in a scheduling round 
can be kept low. Thus system performance may be enhanced. 

To ensure a high system and job performance it is very important to balance 
workloads across the processors. Another advantage of our allocation scheme is 
that it is able to handle the problem of load balancing. Because the workload on 
each processor is recorded in the workload vector, the system can easily choose 
a subset of less active processors for an incoming job if there are several suitable 
subsets. In Fig. 5 there are two subsets of available processors suitable for a new 
job requiring five processors, that is, one from P 3 to P7 and the other from P 4 
to Ps- When the load balancing is taken into consideration, the second subset is 
preferable. It can be seen in the above example that the system can still allocate 
resources to a new job which requires six processors without creating a new time 
slot if the second subset is chosen. 

With job re-packing the buddy based algorithm can be implemented in a 
more efficient way. Assume that a job of size p arrives for n/2 < p < n and n 
being a power of 2. To find a suitable subset of available processors in an existing 
time slot, we need only to divide the workload vector into m groups of size n for 
m = N/n and N being the number of processors in the system and check if there 
is a group in which all entry values are nonzero. If there are several available 
groups, we can choose the least loaded one by simply checking the entry values. 
Thus there is no need to scan all the existing time slots for finding a suitable 
subset and the workloads can more easily be balanced. Since there is a natural 
boundary between each pair of groups and no jobs can come across, re-packing 
jobs in the selected group will not affect the local scheduling tables of processors 
outside the group. 

The procedure for detecting a suitable subset of available processors can fur- 
ther be simplified for the buddy based resource allocation system by introducing 
an additional binary tree structure to record the average group workload, as 
depicted in Fig. 6. The tree has logN levels for N the number of processors in 
the system. The node at the top level is associated with all N processors. The 
N processors are divided into two subsets of equal size and each subset is then 
associated with a child node of the root. The division and association continues 
until the bottom level is reached. Each node on the tree is assigned a value ac- 
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Fig. 6. The workload vector (WLV) and an additional binary tree (WLT) for 
recording the average group workload used for a buddy based resource allocation 
system. 



cording to the values of its two children. It is just a sum of the two values of the 
children when both values are nonzero. Otherwise, it is set to zero. When the 
value is set to zero, the associated subset of processors will not be available for 
new arrivals under the current situation. 

This binary tree is simple to manage and can greatly assist the decision 
making in resource allocation. We can let an existing job on a subset of processors 
run in multiple time slots when the value of the associated node is nonzero. It is 
easy to see in Fig. 6 that jobs J 2 and Jq can run in multiple time slots because 
the values of two nodes on the right at the bottom level are nonzero. However, 
the value of the right node at the middle level is also nonzero. We are able to 
run job J4 in multiple time slots by shifting Jg down to time slot S\ (or J2 up 
to 5s). 

Assume that job Ji in Fig. 6 is terminated. The entry values associated 
with processors P\ and P 2 in the workload vector become nonzero. Then the 
value of the two leftmost nodes at the bottom level of the binary tree will also 
be nonzero. This will cause the value of the left node at the middle level to 
become nonzero and so the root value. Since the top node is associated with all 
processors in the system, we know that there is at least one idle slot on each 
processor. According to Property 2 discussed in the previous section these idle 
fragments can be combined together in a single time slot which can then be 
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eliminated. Using this binary tree, therefore, we are able to know quickly when 
a time slot can be deleted by simply checking the value of the root node to see 
if it is nonzero. 

We can also quickly find a suitable subset of available processors for a new 
arrival simply by reading the values of those nodes at a proper level. Consider 
the situation depicted in Fig. 6 again and assume that a new job of size 4 
arrives. In this case we need only to check the two nodes at the middle level. 
Since the value of the second node is nonzero, we know that the second subset 
of processors is available. We may then re-pack job Jg into time slot Si and 
allocate four processors from P4 to Pg in time slot S3 to the new arrival. Finally 
the values of the associated node and its children are updated. (In this particular 
case they are all set to zero.) 

In the buddy based algorithm processors are continuously divided into two 
equal subsets and processes of the same job are allocated only in a selected 
subset. As we mentioned previously, with this arrangement the number of uni- 
fication counts can be increased and jobs may have better chances to run in 
multiple time slots to enhance system and job performance. However, simply 
running jobs in multiple time slots may not be desirable when fairness is taken 
into consideration. In the above example jobs J2 and Jg can run in both time 
slots Si and S3 before the new job arrives. Assume that no existing jobs are 
completed when the new job arrives. A new time slot has to be created and then 
the performance of those jobs running in a single time slot will be degraded. 

Another potential problem associated with simply running jobs in multiple 
time slots is that the total number of time slots may become large when the 
system is busy. The system overhead will be increased when trying to manage 
a large number of time slots. Therefore, a better way to achieve a high system 
and job performance is to combine the procedures of slot unification and slot 
minimisation together. This combination can easily be implemented with simple 
modifications to our allocation technique discussed in this section. 

5 Conclusions 

In this paper we presented some ideas for resource allocation to enhance the 
performance of gang scheduling. 

We introduced a job re-packing scheme. In this scheme we try to rearrange 
the order of job execution on their originally allocated processors in a scheduling 
round to combine small fragments of available processors into a larger and more 
useful one. We presented two interesting properties for re-packing jobs on a 
parallel system which is logically organised as a one-dimensional linear array. 
These two properties indicate that job re-packing is simple, the system costs 
may not be high and thus the scheme can be practical. 

Based on job re-packing we developed an efficient resource allocation scheme. 
When processors are logically organised as a one-dimensional linear array any 
adjacent fragments can be grouped together to become a larger one in a single 
time slot. Thus we can use a workload vector which records the number of idle 
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slots on each processor to detect a suitable subset of available processors for a 
given job. The problem for searching available processors on a two-dimensional 
global scheduling matrix then becomes a simple one-dimensional search prob- 
lem. Because the scheme considers workload conditions in both space and time 
dimensions simultaneously, it is possible that the average number of slots per 
scheduling round can be kept low and workloads also be well balanced across the 
processors. Therefore, the resources in the system may be utilised more efficiently 
and the performance of parallel jobs may also be enhanced significantly. 

There are many interesting and open problems relating to job re-packing. 
In this paper we only discussed how to re-pack jobs between rows in the global 
scheduling matrix. Therefore, an optimal solution in minimising the average 
number of time slots per scheduling round is achievable only when process or 
thread migration is not allowed. When processes can also be moved between 
columns in the global scheduling matrix, situations will become much more com- 
plicated because we must seriously consider the system overhead. 

We only considered in the paper a simple system configuration, that is, pro- 
cessors in the system are logically organised as a one-dimensional linear array. 
The allocation schemes may work well for clusters of workstations/PCs/SMPs. 
However, there are parallel systems in which processor localities have to be con- 
sidered in order to reduce the communication cost and to alleviate the problem 
of communication contention. An interesting problem is thus how to effectively 
re-pack jobs on other system configurations. 

With job re-packing we are able to combine the procedures of slot unification 
and slot minimisation together. An interesting problem is how to determine when 
slot unification should be applied and when the total number of slots needs to 
be reduced such that the overall system and job performance can be enhanced. 

Job re-packing may introduce extra system costs because of the extra require- 
ment for updating the local scheduling tables on a distributed parallel system. 
On the other hand job re-packing may reduce the system overhead because the 
procedure for finding a suitable subset of available processors becomes cheaper 
especially for a buddy based resource allocation system. Extensive testing on 
real parallel machines is required to measure the actual system overhead and to 
try to find effective methods to further minimise the system costs. 

Experiments based on the ideas described in this paper are to be undertaken 
on distributed-memory parallel machines, the Fujitsu APIOOO-I- and AP3000, at 
the Australian National University. 
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Abstract. Job management subsystems in parallel environments have 
to address two important issues: (i) how to associate processes present in 
the system to the tasks of parallel jobs, and (ii) how to control execution 
of these tasks. The standard UNIX mechanism for job control, process 
groups, is not appropriate for this purpose as processes can escape their 
original groups and start new ones. We introduce the concept of geneal- 
ogy, in which a process is identified by the genetic footprint it inherits 
from its parent. With this concept, tasks are defined by sets of processes 
with a common ancestor. Process tracking is the mechanism by which 
we implement the genealogy concept in the IBM AIX operating system. 

No changes to the kernel are necessary and individual process control is 
achieved through standard UNIX signaling methods. Performance eval- 
uation, on both uniprocessor and multiprocessor systems, demonstrate 
the efficacy of job control through process tracking. Process tracking has 
been incorporated in a research prototype gang-scheduling system for 
the IBM RS/6000 SP. 

1 Introduction 

The job management subsystem of a computing environment is responsible for 
all aspects of controlling job execution. This includes starting and terminating 
jobs as well as the details related to their scheduling. Parallel jobs executing on a 
distributed or clustered system are typically comprised by a set of concurrently 
executing tasks that collaborate with each other. Traditional parallel jobs in the 
scientific computing community {e.g., MPI and HPF programs) consist of a fixed 
set of tasks, each comprised of a single process. However, we are observing that 
in the RS/6000 SP users are beginning to write their parallel programs with 
each task consisting of a set of processes, exemplified by the following situation: 
A csh script fires two collaborating perl scripts, connected by a pipe. Each of 
these perl scripts then executes a few different Fortran or C-h-h programs to 
perform a computation. In this situation, a task is no longer a single process 
but a dynamically changing tree of processes. In fact, in the more general case 
a task can be a forest of processes. Figure 1 shows a parallel job consisting of 
three tasks. Each task in turn consists of a set of processes. 

Many different approaches to controlling the execution of tasks are described 
in the literature [4]. In such systems, schedulers are typically organized in a 
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Fig. 1. A 3-way parallel job, each task a forest of processes. 



two-tier structure, with a central global scheduler for the system and a node- 
level scheduler (NLS) in each node. The central scheduler is responsible for 
resource allocation and job distribution. The node-level scheduler is responsible 
for actually controlling the execution of the individual tasks of that job on its 
corresponding node. This paper focuses on the issues related to local task control, 
as performed by the NLS. 

Depending on the level of inter-NLS synchronization, different scheduling 
variations are possible. Explicit gang-scheduling systems always run all the tasks 
of a job simultaneously [5,6,7,8,9,10,17,21]. In contrast, communication-driven 
systems [3,19,20] are more loosely coupled, and schedule tasks based on message 
arrival. Independent of the inter-task scheduling approach, we address those 
cases in which all processes within a task are closely coupled, like the csh example 
of the first paragraph. In these situations, all processes of a task must be enabled 
for execution to allow effective progress within the task. 

Typical UNIX operating systems (OS) have their own process scheduling 
semantics, that aim at improving interactive response times and are based on 
process priorities. The OS schedulers operate at very fine granularity (order of 
10ms) and are very limited with respect to outside control. These schedulers 
are not geared to the task-centric scheduling required in parallel job control. 
The traditional approach to bypass the lack of external scheduler support on OS 
schedulers is to dedicate an entire node (OS image) to the execution of a single 
task within a parallel job [11,18,13]. A more flexible approach would permit 
multiple tasks to share a node, both in space and time. 

Our goal is to develop a task control system that can give each task the 
illusion that it is running on its own virtual machine. The virtual machines are 
created by slicing a physical machine along the time axis. Within each virtual 
machine, all processes of a task are controlled and executed by the OS scheduler. 
The time-slices for these virtual machines have to be large enough to amortize the 
cost of context switching. Production parallel applications are typically memory 
hungry, and it can take quite a while for them to bring their working data from 
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paging space. In fact, simple time-sharing (through the OS scheduler) of parallel 
jobs can easily lead to trashing of the system. While it is active, each task must 
also have, for performance and security reasons, dedicated access to the parallel 
machine’s communication device. Context-switching of these devices is also an 
expensive operation, as reported in [7]. For the above reasons, the granularity of 
the task scheduler is therefore much larger (order of seconds or even minutes) 
than that of the underlying OS. This large granularity can be tolerated in our 
environment as we target noninteractive applications. 

To effectively implement this virtual machine concept, the node-level sched- 
uler must be able to control tasks as single entities. All the processes in a task 
must be suspended or resumed on occasion of a context switch. There cannot be 
“stray” processes that either are not suspended when the task is suspended or 
not resumed when the task is resumed. This behavior must be enforced so that 
we can guarantee there is no interference between suspended and running tasks. 
In addition, the time interval it takes to suspend or resume all the processes in 
a task must be small compared to the time-slice for the task. Finally, we must 
have a mechanism that precisely defines the set of processes that belong to a 
particular task of a parallel job. To that purpose, we introduce the concept of 
genealogy of processes. 

Each task of a parallel job starts as a single process, which defines the root 
of that task. This root process can create new processes, which are its direct 
descendants and belong to the same task. The process creation can continue 
recursively and all descendants of the original root of the task belong to the same 
task. Belonging to a task is a genetic property of a process. It cannot be changed 
and it does not depend on which of its ancestors are alive at any given time. 
The genetic footprint of a process defines which task it belongs to. Genealogy 
is a good way to define a task for the following reasons. First, resources must 
be allocated and accounted for during the execution of a job. These resources 
include, among others, memory, disk space, and processor time. Any process 
associated with that job should be considered as using those resources. Second, it 
defines a scheduling unit that can be controlled through various priority schemes. 
Even though the set of processes comprising a task is dynamically changing, the 
binding of processes to tasks is established at process creation time and remains 
fixed until process termination. 

We define five mechanisms that are necessary for implementing the geneal- 
ogy concept in a job management system. For future reference, we name these 
mechanisms Mi though M 5 : 

Mi: a mechanism to create new tasks, 

M 2 : a mechanism to associate processes to tasks, 

M 3 : a mechanism to terminate all processes in a task, 

M4: a mechanism to capture all dynamic process creation and termination 
in order to establish the genetic footprint, and 
M 5 : a mechanism to prevent any “escape hatches” that allow processes 
to leave a task. 
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These are in addition to the basic functionality of suspending and resuming 
execution of a task, necessary for the actual scheduling operation. 

In this paper we describe the difficulties that we encountered in implement- 
ing the genealogy concept for task control, and what solutions we adopted. In 
Section 2 we explain why the existing UNIX concepts for process control are 
not appropriate for the kind of task control we desire. In Section 3 we discuss 
the particulars of our implementations and in Section 4 we present some ex- 
perimental results for our task control mechanism on a single node. Section 5 
discusses the integration of process tracking on an actual job scheduling system. 
Our conclusions are presented in Section 6. 



2 Existing UNIX Mechanisms 

In standard UNIX systems, processes are the units of scheduling. In order to 
perform job based scheduling one has to use process set mechanisms. Modern 
UNIX systems, such as SVR4 and 4.4BSD provide two standard notions of pro- 
cess sets. The first one is process group and the second one is session [22]. Process 
groups are provided to identify a set of related processes that should receive a 
common signal for certain events. Process groups are the standard UNIX mech- 
anism for implementing job control. Sessions are provided to identify a set of 
related processes that have a common controlling terminal (i.e., are part of one 
login session). One session can have several process groups. 

A process group is defined by its group leader. This is the process which 
initially creates a new group through a call to setpgid. Default process groups 
are formed by the parent-child relationship that is established when a process 
forks itself to create another process. The child process is created in the same 
process group as its parent. Execution of all processes in a process group can 
be suspended by sending a signal(S\GSTOP) to that group. Correspondingly, 
execution can resume by sending a signaZ(SIGCONT) to that group. At a first 
glance this seems to implement the genealogy concept previously introduced. 
However, a UNIX process can switch to a different process group or start its 
own. This constitutes an escape from its original group, thus failing to implement 
mechanism M5 presented above. Also, if a process group leader terminates, all 
the remaining processes in that group are reassigned to a built-in process group 
0. This in turn fails mechanism M2, since we can no longer associate processes 
with their task. 

Similar to what happens with process groups, a session is defined by its 
session leader. This is the process which initially creates a new session through a 
call to setsid. Again, default sessions are formed by the parent-child relationship: 
children initially inherit the session from their parent. Processes can start new 
sessions, thus disconnecting from their original session and creating the same 
problems as described for process groups. 

In addition to these two explicit ways to define process sets in UNIX {i.e., 
groups and sessions), there is an implicit way through the transitive parent-child 
relationship. UNIX provides mechanisms to obtain snapshots (samples) of the 



148 



Hubertus Franke et al. 



list of processes executing at a given time. (Examples of these mechanisms are 
the getprocs function and the ps command.) For each process this list reports its 
parent. By properly processing this list we can build a set that has its origin on 
a particular root process. However, this fails to establish the proper genealogy of 
processes for the following reasons. First, it is not an atomic operation: processes 
can be created and terminated while the list is being examined. Second, if the 
process list is not sampled at least once between the events of a child being 
created and its parent terminating, the genetic property of the child is lost. The 
very first time we come to know about the child process will show init as its 
parent. Finally, since process identifiers are reused in UNIX, one might associate 
a process with the wrong task if sampling misses this reuse between an old 
and a new process. In summary, the parent-child mechanism of UNIX does not 
properly implement the genealogy concept as required for proper task control. 

Some systems have attempted to overcome the shortcomings of the UNIX 
parent-child model by using a special library that intercepts the fork() and 
sigaction() system calls [12]. This approach allows one to record every event 
of process creation and termination and maintain information on the geneal- 
ogy of processes. However, this is an incomplete solution, as it depends on the 
collaboration of user applications in linking to those libraries. 

The concept of task control has been successfully implemented in other oper- 
ating systems. In the OS/390 operating system, the enclave is a concept similar 
to our task [1]. When a request {e.g., web request) enters into the system, and a 
thread is started in some process to service this request, a new enclave is created. 
The enclave essentially represents a unit of work being performed on behalf of 
a request. If this first thread initiates new threads in different processes {e.g., 
the web request leads to a database operation) , then these new threads are also 
added to the enclave. Accounting of resource consumption is done on an enclave 
basis. Furthermore, all threads of an enclave can be scheduled concurrently. 



3 Implementing Process Tracking 



After examining the options offered by UNIX, we did not find any functional- 
ity that satisfactorily implements the genealogy concept. It was not an option 
for us to modify the existing commercial AIX operating system to introduce a 
new construct. The solution we adopted was the development of a process track- 
ing kernel extension. A kernel extension is a mechanism to dynamically load 
additional code into the kernel space, thus extending the kernel with new func- 
tionality that can be accessed by user-level programs. These extensions execute 
in the kernel mode of the calling process and have access to all kernel related 
activities. The purpose of our process tracking kernel extension is to monitor 
and log all process creation and termination events. Based on that information, 
the kernel extension maintains the genealogy of selected process sets. The kernel 
extension also implements the task control functions necessary for scheduling: 
task suspend, task resume, and task kill. 
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Our process tracking kernel extension makes use of one particular AIX kernel 
feature, the process state change handler. As the name indicates, process state 
change handlers in AIX are invoked every time a process state changes. Several 
process state change handlers can be chained. In particular, the following events 
trigger a call to the handlers: process creation, process termination, thread cre- 
ation, and thread termination. Consequently, process state change handlers have 
strict performance requirements and have to be as little intrusive as possible. In 
our case that implies ignoring events unrelated to tracked processes and main- 
taining the genealogy of tracked processes efficiently. 

The operation of the process tracking is illustrated in Figure 2. The kernel 
extension maintains the following data structures: 

— TrackedProcessObj: An object of this type is maintained for each process 
being tracked. It includes the process identifiers {pids) of the process and its 
genetic parent. It also contains pointers to the TrackedProcessObjs for one 
sibling process and one child process. (The sibling pointer is used to form a 
list of children.) This object is necessary because the kernel process object, 
the uproc structure, cannot be modified directly. 

— TaskObj: An object of this type is maintained for each task. It contains a 
task identifier that is assigned by the node level scheduler. It also maintains 
pointers to the TrackedProcessObjs of the top level processes belonging to 
the task. The collection of all TaskObjs constitutes the list of tasks to be 
monitored in the system. 

The process state change handler is loaded once in the kernel extension and 
immediately starts monitoring process creation and termination events. Because 
the task list is initially empty, none of these events has any effects. When the 
node-level scheduler creates a new task, it registers that task and its root process 
with the kernel extension. This involves the creation of a new TaskObj and a 
new TrackedProcessObj. From this point on the process is being tracked. This 
implements mechanism Mi. Mechanisms M 2 , M 4 , and M 5 are implemented by 
monitoring all process creation and termination events. 

When a new process is created, its parent’s pid is searched in the set of 
tracked processes. To perform an efficient search we provide a hashtable access 
based on the pid. If the parent is a tracked process, and hence belongs to some 
task, we create a new TrackedProcessObj, add it to the children list of the parent 
process, and create a new entry in the hashtable. 

When a tracked process terminates, we remove its associated TrackedPro- 
cessObj from its parent’s children list. Then its own children are added to its 
parent’s children list. (The TaskObj acts as a parent to the top level processes.) 
This is very different from the way the kernel maintains the parent-child rela- 
tionship. Whereas the kernel makes init adopt orphan processes, we implement 
a policy in which orphans are adopted by their most immediate ancestor still 
alive. (As an alternative policy, orphans could always be adopted by the cor- 
responding TaskObj.) This approach always keeps a process associated with its 
original task. 
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Fig. 2. Organization of the kernel extension. 



All accesses to the kernel extension functions are serialized via a lock. Through 
this we implement the atomicity required by the task control operations. Task 
execution is controlled from the kernel extension through standard UNIX sig- 
naling mechanisms as follows. 

Task Suspend: Using a top-down depth-first traversal of the task’s Tracked- 
ProcessObj tree, we issue a si5naZ(SIGSTOP) to suspend the execution of each 
process. We verify that the process indeed has stopped before proceeding to the 
next process. We have to use a traversal order that guarantees that a parent pro- 
cess is stopped before its children are signaled. This avoids the scenario where 
a parent detects its child stopped and takes some action. (For example, in csh 
this situation implies that CTRL-Z was issued to the child process). If a process 
does not stop immediately, we exit from the suspend operation with an EAGAIN 
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error code. (In a multiprocessor system, a process being stopped could be active 
on a different processor. In this case, it will only receive the signal when it re- 
turns from a system call or when it is about to run in a future OS time-slice.) 
Upon detecting the EAGAIN error code, the node- level scheduler retries the op- 
eration after waiting for a period we refer to as the retry interval. During the 
subsequent invocation, the TaskSuspend function verifies whether a process has 
already been signaled previously with the same signal and whether it has indeed 
stopped. If so, we continue signaling the remaining processes. This mechanism 
deals properly with the dynamics of process creation and termination between 
retries. (Note that, on a retry, we do not have to resend signals, just make sure 
that they have taken effect.) 

Task Resume: Using a bottom-up depth-first traversal of the task’s TraekedPro- 
cessObj tree, we use signal{S\GCONT) to resume execution of each process. We 
verify that the process has indeed resumed execution before proceeding to the 
next process. This traversal order is the opposite of what we use when suspend- 
ing a task. Again, the goal is to avoid the scenario where a parent detects its 
child stopped. If a process does not resume immediately, we exit from the resume 
operation with an EAGAIN error code, to let the node level scheduler retry this 
operation. 

Task Kill: The algorithm to kill a task also uses a top-down traversal of the 
task’s TraekedProcessObj tree, issuing a szgna/(SIGKILL) to each process. Since 
the action of a SIGKILL is guaranteed, there is no need to test before proceeding 
to the next process. This is the implementation of mechanism M 3 . 

Task suspend, resume, and kill operations can also be selectively performed 
on a subtree of the processes of a task. 

An odd situation can arise due to the access serialization in the kernel ex- 
tension functions. When a process in the middle of its creation event recognizes 
that its parent has stopped, it is immediately stopped as well. This mechanism, 
together with the top-down traversal, guarantees that, with a finite number of 
retries, progress is made in suspending the task. Ultimately, the number of pro- 
cesses that can be created is limited by the operating system. At one point, all 
processes belonging to the task will be suspended. 

4 Experimental Results 

In this section we discuss an experimental evaluation of our task control mech- 
anism. We conduct this evaluation through direct measurement of the context 
switch time between two tasks. This is a very common operation performed 
by the process tracking facility when implementing time-sharing. A complete 
context switch operation, from a currently running task A to a currently sus- 
pended task B, requires first suspending task A and then resuming task B. The 
mechanism for suspending and resuming tasks is discussed in Section 3. 



152 



Hubertus Franke et al. 



For the purpose of our experiments we use tasks that are binary trees of 
processes. We have chosen three types of tasks: 2, 3, and 4 levels deep, for a 
total of 3, 7, and 15 processes per task, respectively. The tasks use very little 
memory and all fit comfortably within the physical memory of our test systems. 
We measure the context switch time between two tasks with the same number 
of processes. The measurements are repeated several times and the results pre- 
sented in this paper represent arithmetic means of the samples. We perform the 
measurements for different values of the retry interval, that is the interval the 
node level scheduler waits before retrying a suspend or resume operation that 
fails. 

The first set of results that we present is for a uniprocessor machine: an 
160 MHz P2SC thin-node of an IBM RS/6000 SP system, with 256 MB of main 
memory. Results for this machine are shown in Figure 3. The context switch op- 
erations are very fast, taking less than 1 ms to complete. Also, the context switch 
time is proportional to the number of processes in the tasks, as expected from 
the description in Section 3. Finally, the context switch time is invariant to the 
retry interval. This occurs because retries are extremely infrequent. A suspended 
process cannot be running, therefore a sjgna/(SIGCONT) to such a process has 
immediate effect, bringing it to active state. Correspondingly, if the node level 
scheduler is running and performing a suspend operation, the processes from the 
tasks being suspended are not running (this is a uniprocessor system). Hence, 
the signal{S\GSTOP) takes effect immediately and the processes are suspended. 
(There are some rare situations when the signal cannot be processed immediately 
and therefore retries are necessary.) 
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Fig. 3. Context switch times for a uniprocessor system. 



We repeat these experiments on a multiprocessor machine: an IBM RS/6000 
model S70 system. This machine has four 125 MHz 64-bit RS64 processors and 
512 MB of main memory. Results for this machine are shown in Figure 4. We 
present the same data using nonlinear (Figure 4(a)) and linear (Figure 4(b)) 
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time scales. (The nonlinear scale shows more detail in the 1-10 ms range, while 
the linear scale allows for a better understanding of the behavior in the 10-50 ms 
range.) The major difference in comparison to Figure 3 is the effect of the retry 
interval on the total context switch time. Overall, the context switch time is 
much larger than on an uniprocessor, and it increases with the retry interval. 
We also observe, iir Figure 4(b), that the context switch time is approximately 
linear oir the retry iirterval iir the 10 to 50 ms range. Context switch times oir a 
multiprocessor are dominated by the time to suspeird a running task. Resumiirg a 
suspeirded task still does irot require airy retries and is very fast. The maximum 
value we observed for resuming a task with 15 processes was approximately 
0.3ms. 



Context switch time for different retry intervais 
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(a) nonlinear time scale (b) linear time scale 



Fig. 4. Context switch times for a multiprocessor system. 



The interesting behavior occurs when suspending a running task. Because we 
are now dealing with a multiprocessor system, it is possible for an application 
process to be executing at the same time the node level scheduler is trying to 
suspend it. When that occurs, a szgna/(SIGSTOP) is sent to the process but it 
is not actually processed until the next time the process is about to run. The 
suspend operation fails and the node level scheduler has to retry. 

We can model the context switch time as a function of the retry interval 
as follows. Consider a multiprocessor system with n processors and only the 
following processes: the node level scheduler, the processes from the active task 
and the processes from the suspended task. We know that the context switch 
time is dominated by the time to suspend processes that are actually rmrniirg. 
Hence, that will be the focus of our following discussion. Let t be the retry 
iirterval used by the node level scheduler and let there be p processes in the task 
to be suspended. Let tfc be the time that it takes to suspend one process when 
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there are still k processes active. The total time 9 to suspend the active task is: 

0 = jZ^k. ( 1 ) 

fe=i 

Time tk can be computed as 



tk = Pr{k, n) X r X T 



( 2 ) 



Where Pr{k,n) is the probability that the particular process to be suspended is 
running on one of n processors and r is the average number of retries before it is 
stopped. (If the process is not running, the time to stop it is negligible, as seen 
for the uniprocessor system.) We know that the node level scheduler is running 
on a processor for sure, leaving n — 1 processors for the task processes. Also, if 
fc < n — 1, then the process must be running on some processor. Therefore, we 
can write 

Tl — 1 

= min(l, — ; — ). (3) 

k 

The mean number of retries to stop a running process is 



r = 



^stop(^j 

r 



( 4 ) 



Tstopik, n) is the time it takes for a running process to respond to sjgnaZ(SIGSTOP) 
when there are k processes active on n processors. (The node level scheduler 
sleeps between retries, so all n processors are available for running task pro- 
cesses.) At each system clock tick of length T, a system with n processors runs 
min(fc,n) processes. (In AIX, this clock tick is typically 10 ms.) Because of the 
round-robin policy for processes at the same priority level, it will take 



k 1 
min(fc,n) 2 



( 5 ) 



clock ticks for the signaled process to be scheduled again. (The —1/2 term ac- 
counts for the fact that a s«gna/(SIGSTOP) can be issued any time during the 
current clock tick.) The actual time for stopping a running process is: 



^Stop(^5 ^) 



and 



k 1 
min(fc,n) 2 
2k — min(A:, n) 
2 min(fc, n) 

{2k — min(fc, n))T 



T 




IT 


( 6 ) 




( 7 ) 



2 min{k, n)T 

Substituting Equation (3) and Equation (7) into Equation (2) leads to 

{2k — min(/c, n))T 



■ n — I 
tk = mm(l, — - — ) 



2 min(A:, n)i 



(8) 
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and 



fc=i 



(2k — min(fc, n))T 
2 min(A:, n)T 



(9) 



The correspondence between our model and experimental results is shown in 
Figure 5, again with two different time scales. Values from the model are shown in 
solid lines, whereas the markers represent the experimental data. Overall agree- 
ment is very good, with a tendency by the model to overestimate the context 
switch time. This tendency might be explained by the presence of other pro- 
cesses in the system. In real UNIX systems there are many active processes at 
any given time. For example, even in single-user mode our multiprocessor system 
had 70 active processes, in addition to the task processes of our experiments. 
These are usually system daemons that consume little resources, maybe 1 to 2% 
of total CPU time when combined. Nevertheless, they contribute to decrease the 
probability that application processes will be running at any given time, thus 
reducing the total suspend time for a task. 



Context switch time for different retry intervals 
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Fig. 5. Model and experimental data for context switch times for a multiproces- 
sor system. 



Both the model and the experimental results show that there is little varia- 
tion in the context switch time when the retry interval is varied between 1 and 5 
ms. Retrying too soon is ineffective, as the process being signaled will not have 
had time to process the signal. Since each retry operation is a kernel extension 
invocation involving kernel locks, repeating it too often can hurt system per- 
formance. Therefore, a retry interval of 5 or 10 ms is the best choice for the 
node-level scheduler on this system. 



156 



Hubertus Franke et al. 



5 System Integration 

Process tracking is currently used to implement task control in a research pro- 
totype gang-scheduling system for the RS/6000 SP [15]. We discuss this system 
here as an example of an application of process tracking. We emphasize that 
process tracking is applicable in many other environments. The general problem 
that process tracking tackles is that of resource reclaiming: We want to be able 
to get back resources that a job is using in order to give them to another job. 
In the case of gang-scheduling, that resource is (mainly) processor time. Pro- 
cess tracking can also be used to control “stray” processes that would typically 
overlast the lifetime of a job and continue to consume system resources. 

Gang-scheduling is a mechanism for performing coordinated scheduling of 
the tasks of a parallel job [16]. It partitions the resources of a parallel system 
both in space and time, and all tasks of one job execute concurrently. Gang- 
scheduling has been shown to improve system performance, in particular by 
improving system utilization and reducing job wait time [5,15]. The operation of 
a gang-scheduled system is characterized by a stream of context-switch events, 
that occur at the boundaries of the time-slices. On each context-switch event 
a currently running parallel job is suspended and another job is enabled for 
execution. 

The gang-scheduling system we developed follows the two-tier model dis- 
cussed in Section 1, with a centralized global scheduler and local node- level 
schedulers. The role of the central scheduler is to derive a global schedule for 
the system and then distribute the relevant subsets to the local schedulers in an 
efficient manner. While there are many approaches to representing global sched- 
ules, we have chosen the Ousterhout matrix, due to its simplicity and generality. 
Using hierarchical distribution schemes, we deliver each column of the matrix to 
its designated node-level scheduler. For multiprocessor nodes, the corresponding 
node-level scheduler receives multiple columns. The node-level scheduler is then 
responsible for implementing this local schedule as described by the columns of 
the Ousterhout matrix. It accomplishes that by executing the context switch 
operations at the time dictated by the schedule. The system relies on some form 
of a synchronized clock for all node-level schedulers. This can be either a global 
clock, or distributed local clocks that are kept synchronized with NTP [14]. Note, 
that in the case of a multiprocessor node, a single node-level scheduler may have 
to perform multiple context switch operations. 

The environment we use to measure application performance under our gang- 
scheduling system consists of an IBM RS/6000 SP with 9 compute nodes and 
an additional node dedicated to handling job submission and running the global 
scheduler. Each compute node has four 332 MHz PowerPG 604e processors that 
share 1.5 GB of main memory. Job execution is controlled by our gang-scheduling 
prototype, using a time-slice of 10 seconds. As benchmarks we use the three 
pseudo-applications from the NAS Parallel Benchmark suite, version 2.3 [2]: BT, 
LU, and SP. Each benchmark is written in Fortran with calls to the MPI message- 
passing library. BT and SP are compiled to run with 36 tasks (requiring all 9 
nodes), while LU is compiled to run with 32 tasks (requiring 8 nodes). Each 
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task consists of exactly three processes. One of the processes implements the 
benchmark itself, while the other two are support processes that implement the 
parallel environment. 

We first run each benchmark on a dedicated environment and measure their 
execution time. This corresponds to gang-scheduling with a multiprogramming 
level (MPL) of one. We then run two, three, and four instances of each benchmark 
at a time, which corresponds to MPLs of 2, 3, and 4, respectively. Results from 
these experiments are shown in Table 1 and in Figure 6. For each benchmark. 
Table 1 shows the memory footprint per task, and the average execution time (in 
seconds) under different multiprogramming levels. Figure 6 presents the same 
performance results graphically, with the execution times for each benchmark 
normalized to the execution time of a single instance of the benchmark in a 
dedicated environment, and then divided by the multiprogramming level. We 
note that the highest memory consumption occurs for BT with an MPL of 4. In 
that case, there are 16 tasks running in each node, for a total memory footprint 
of 590 MB. (The support processes have small footprints.) This is still much 
smaller than the 1.5GB of main memory available in each node, and therefore 
there is minimal paging during execution of the benchmarks. 



Table 1. Results for running the NAS Parallel Benchmarks BT, LU, and SP 
under gang-scheduling. 



benchmark 


number 
of tasks 


footprint 


Average execution time (s) 


(MBytes) 


MPL = 1 


MPL = 2 


MPL = 3 


MPL = 4 


BT, class B 


36 


36.9 


568.79 


1157.14 


1763.32 


2339.36 


LU, class B 


32 


8.5 


371.21 


743.92 


1202.03 


1599.72 


SP, class B 


36 


15.2 


432.95 


916.27 


1360.56 


1842.16 



In the ideal case, running k instances of a benchmark should slow their ex- 
ecution by a factor of k. In this ideal case, all bars in Figure 6 should be at a 
value of 1. Any value above 1 represents overhead resulting from time-sharing. 
We note that the worst overhead occurs for LU with multiprogramming levels of 
3 and 4. In those cases, the measured execution time is 8% larger than expected 
from the ideal case. Overall, our gang-scheduling system does an effective job 
of providing the illusion of a slower virtual machine for the execution of each 
job. Explicit control of the time slices allows us to amortize the cost of context 
switching over a long enough time slice. It also allows for resource dedication for 
longer periods and thus one can expect higher cache hit rates, lower page fault 
rates, and better communication performance (due to the synchronous nature 
of the applications). The footprint of our benchmarks were too small to exercise 
the memory paging system, and we plan to conduct such experiments in the 
near future. 



158 



Hubertus Franke et al. 



Application performance under gang-scheduling 
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Fig. 6. Performance of BT, LU, and SP under gang-scheduling. 
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6 Conclusions and Future Work 

In this paper we introduced the new concept of process genealogy to define the 
set of processes comprising a task. Process to task binding is an intrinsic char- 
acteristic of the process which it inherits from its parent and which can not be 
modified. None of the existing process set concepts in UNIX satisfy the require- 
ments for building task based genealogy. We have successfully implemented the 
genealogy mechanisms without modifying the operating system. Our approach 
is based on a process tracking kernel extension that monitors process creation 
and termination events and builds a database representing the genealogy con- 
cept. Process tracking also implements atomic suspend and resume operations 
for tasks. These operations form the core of context-switching for time-sharing 
systems. 

We have used process tracking as an integral part of our gang-scheduling sys- 
tem for the RS/6000 SP. An initial prototype of this system has been installed at 
Lawrence Livermore National Laboratory. Deployment in production mode will 
follow soon. Our preliminary results indicate little overhead from the context- 
switching as performed by the process tracking facility. As an added benefit, we 
eliminated the possibility of any stray processes escaping termination control. 
The development of process tracking was motivated by the necessary function- 
ality. Nevertheless, our experiments have shown that we can use it to implement 
efficient, low overhead task control. (The experiments have also shown the im- 
portance of using the proper retry interval, in the 5 to 10 ms range, in suspending 
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tasks. 10 ms is the base operating system time-slice.) Process tracking is a vehicle 
through which an enhanced user-level scheduling can be added to a system. 

In terms of future work, we are investigating alternative ways to traverse the 
forest of processes that constitute a task. We currently perform a depth first 
traversal, which we quit (and later retry) when we find a process that is not 
immediately suspended or resumed. We could also use a breadth first traversal: 
Attempt to suspend/resume all processes at a given level before proceeding to the 
next level. This approach would require changes in our data structures and needs 
to be investigated. There is also a hybrid, or optimized, depth first approach: 
We start traversing the tree in depth first mode. If we hit a process that we 
cannot stop then, instead of proceeding to its children, we just move on to its 
siblings. (See Figure 7.) This traversal can be implemented with our current data 
structures. 





Fig. 7. Depth first approach (a) stops as soon as one task (marked X) cannot 
be suspended or resumed. The hybrid approach (b) continues with the siblings 
of task X. 



Process genealogy is an important concept for job control and accounting. 
We have successfully implemented the genealogy concept through process track- 
ing in a commercial operating system. Other implementation strategies are also 
possible. In particular, as a new research, we are currently investigating the 
migration of the genealogy concept directly into the kernel of the AIX UNIX 
operating system. 
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Abstract. Recent technological developments, including gigabit net- 
working technology and low-cost, high-performance microprocessors, have 
given rise to metacomputing environments. Metacomputing environments 
combine hosts from multiple administrative domains via transnational 
and world-wide networks. Managing the resources in such a system is 
a complex task, but is necessary to efficiently and economically execute 
user programs. The Legion resource management system is flexible both 
in its support for system-level resource management but also in their 
adaptability for user-level scheduling policies. 



1 Introduction 

Legion [6] is an object-oriented metacomputing environment, intended to connect 
many thousands, perhaps millions, of hosts ranging from PCs to massively par- 
allel supercomputers. Such a system will manage millions to billions of objects. 
Managing these resources is a complex task, but is necessary to efficiently and 
economically execute user programs. In this paper, we will describe the Legion 
scheduling model, our implementation of the model, and the use of these mech- 
anisms to support user-level scheduling. To be successful. Legion will require 
much more than simply ganging computers together via gigabit channels — a 
sound software infrastructure must allow users to write and run applications in 
an easy-to-use, transparent fashion. Furthermore, the software must unite ma- 
chines from thousands of administrative domains into a single coherent system. 
This requires extensive support for autonomy, so that we can assure administra- 
tors that they retain control over their local resources. 

In a sense, then, we have two goals which can often be at odds: users want 
to optimize factors such as application throughput, turnaround time, or cost, 
while administrators want to ensure that their systems are safe and secure, 
and will grant resource access according to their own policies. Legion provides a 
methodology allowing each group to express their desires, with the system acting 
as a mediator to find a resource allocation that is acceptable to both parties. 

* This work was funded in part by NSF grant CDA9724552, ONR grant N00014- 
98-1-0454, Northrup-Grumman contract 9729373-00, and DOE contracts DEFG02- 
96ER25290, SANDIA #LD-9391, and D45900016-3G. 
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Legion achieves this vision through a flexible, modular approach to schedul- 
ing support. This modularity encourages others to write drop-in modules and 
to customize system behavior. We fully expect others to reimplement or aug- 
ment portions of the system, reflecting their needs for specific functionality. 
For scheduling, as in other cases, we provide reasonable default policies and al- 
low users and system administrators to customize behavior to meet their needs 
and desires. Our mechanisms have cost that scales with capability — the effort 
required to implement a simple policy is low, and rises slowly, scaling commen- 
surately with the complexity of the policy being implemented. This continuum 
is provided through a substrate rich in functionality that simplifies the imple- 
mentation of scheduling algorithms. 

Before we proceed further, it is important to note a crucial property of our 
work: we neither desire nor profess to be in the business of devising scheduling 
algorithms. We are providing enabling technology so that researchers focusing 
on research in distributed scheduling can build better schedulers with less effort. 
To paraphrase a popular television commercial in the USA, “We don’t make a 
lot of the schedulers you use. We make a lot of the schedulers you use better.”^ 

Section 2 describes the Legion metacomputing system, and Section 3 out- 
lines the resource management subsystem. We develop a Scheduler using Legion 
resource management in Section 4, and describe other resource management 
systems for metacomputing in Section 5. Finally, we give concluding remarks in 
Section 6. 



2 Legion 



The Legion design encompasses ten basic objectives: site autonomy, support 
for heterogeneity, extensibility, ease-of-use, parallel processing to achieve perfor- 
mance, fault tolerance, scalability, security, multi-language support, and global 
naming. These objectives are described in greater depth in Grimshaw et al. [6]. 
Resource Management is concerned primarily with autonomy, heterogeneity, and 
performance, although other issues certainly play a role. 

The resulting Legion design contains a set of core objects, without which the 
system cannot function, a subset of which are shown in figure 1. These objects 
are critical to resource management in that they provide the basic resources 
to be managed, and the infrastructure to support management. Between core 
objects and user objects lie service objects — objects which improve system per- 
formance, but are not truly essential to system operation. Examples of service 
objects include caches for object implementations, file objects, and the resource 
management infrastructure. 

In the remainder of this section, we will examine the core objects and their 
role in resource management. For a complete discussion of the Legion Core Ob- 
jects, see [10]. We will defer discussion of the service objects until section 3. 

^ Apologies to BASF. 
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2.1 Legion Core Objects 

Class objects (e.g. HostClass, LegionClass) in Legion serve two functions. As 
in other object-oriented systems, Classes define the types of their instances. In 
Legion, Classes are also active entities, and act as managers for their instances. 
Thus, a Class is the final authority in matters pertaining to its instances, includ- 
ing object placement. The Class exports the create_instance() method, which is 
responsible for placing an instance on a viable host.^ createJnstance takes an 
optional argument suggesting a placement, which is necessary to implement ex- 
ternal Schedulers. In the absence of this argument, the Class makes a quick (and 
almost certainly non-optimal) placement decision. 




Fig. 1. The Legion Core Object Hierarchy 



The two remaining core objects represent the basic resource types in Le- 
gion: Hosts and Vaults. Each has a corresponding guardian object class. Host 
Objects encapsulate machine capabilities (e.g., a processor and its associated 
memory) and are responsible for instantiating objects on the processor. In this 
way, the Host acts as an arbiter for the machine’s capabilities. Our current 
Host Objects represent single-host systems (both uniprocessor and multiproces- 
sor shared memory machines), although this is not a requirement of the model. 
We are currently implementing Host Objects which interact with queue man- 
agement systems such as LoadLeveler and Condor. 

To support scheduling. Hosts grant reservations for future service. The ex- 
act form of the reservation depends upon the Host Object implementation, but 
they must be non-forgeable tokens; the Host Object must recognize these tokens 
when they are passed in with service requests. It is not necessary for any other 

^ When we write “host” we refer to a generic machine; when we write “Host” we are 
referring to a Host Object. 
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object in the system to be able to decode the reservation token (more details 
on reservation types are given in section 3.1). Our current implementation of 
reservations encodes both the Host and the Vault which will be used for execu- 
tion of the object. Vaults are the generic storage abstraction in Legion. To be 
executed, a Legion object must have a Vault to hold its persistent state in an 
Object Persistent Representation (OPR). The OPR is used for migration and 
for shutdown/restart purposes. All Legion objects automatically support shut- 
down and restart, and therefore any active object can be migrated by shutting 
it down, moving the passive state to a new Vault if necessary, and activating the 
object on another host. 

Hosts also contain a mechanism for defining event triggers — this allows a Host 
to, e.g., initiate object migration if its load rises above a threshold. Conceptually, 
triggers are guarded statements which raise events if the guard evaluates to 
a boolean true. These events are handled by the Reflective Graph and Event 
(RGE) mechanisms in all Legion objects. RGE is described in detail in [13,15]; 
for our purposes, it is sufficient to note that this capability exists. 



3 Resource Management Infrastructure (RMI) 



Our philosophy of scheduling is that it is a negotiation of service between au- 
tonomous agents, one acting on the part of the application (consumer) and one 
on behalf of the resource or system (provider) . This approach has been validated 
by both our own past history [4,8] and the more recent work of groups such as 
the AppLeS project at UCSD [1]. These negotiating agents can either be the 
principals themselves (objects or programs), or Schedulers and intermediaries 
acting on their behalfs. Scheduling in Legion is never of a dictatorial nature; 
requests are made of resource guardians, who have final authority over what 
requests are honored. 
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Resource Objects 
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Fig. 2. Ghoices in Resource Management Layering 
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Figure 2 shows several different layering schemes that can naturally arise in 
metasystems. In part (a), the application does it all, negotiating directly with re- 
sources and making placement decisions. In part (b), the application still makes 
its own placement decision, but uses the provided Resource Management services 
to negotiate with system resources. Part (c) shows an application taking advan- 
tage of a combined placement and negotiation module, such as was provided in 
MESSIAHS [4]. The most flexible layering scheme, shown in part (d), performs 
each of these functions in a separate module. Without loss of generality, we will 
write in terms of the fourth layering scheme, with the understanding that the 
Scheduler may be combined with other layers, thus producing one of the simpler 
layering schemes. Any of these layerings is possible in Legion; the choice of which 
to use is up to the individual application writer. 

Legion provides simple, generic default Schedulers that offer the classic “90%” 
solution — they do an adequate job, but can easily be outperformed by Schedulers 
with specialized algorithms or knowledge of the application. Application writers 
can take advantage of the resource management infrastructure, described below, 
to write per-application or application-type-specific user-level Schedulers. We are 
working with Weissman’s group at UTSA [16] to develop Schedulers for broad 
classes of applications with similar structures (e.g. 5-point stencils). 

Our resource management model, shown in figure 3, supports our scheduling 
philosophy by allowing user-defined Schedulers to interact with the infrastruc- 
ture. The components of the model are the basic resources (Hosts and Vaults), 
the information database (the Collection), the schedule implementor (the Enac- 
tor), and an execution Monitor. 

The Scheduler is responsible for overall application coordination (recall that 
we are using option (d) from the layering scheme in figure 2). It decides the 
mapping of objects (subtasks) to hosts, based on the current system state. The 
Scheduler can obtain a snapshot of the system state by querying the Collec- 
tion, or it may interact directly with resources (Hosts and Vaults) to obtain the 
freshest state information available. Once the Scheduler computers the schedule, 
it passes the schedule to the Enactor, and the Enactor negotiates with the re- 
sources objects named in the schedule to instantiate the objects. Note that this 
may require the Enactor to negotiate with several resources from different ad- 
ministrative domains to perform co-allocation. After the objects are running, the 
execution Monitor may request a recomputation of the schedule, perhaps based 
on the progress of the computation and the load on the hosts in the system. 

Figure 3 and the following discussion describe the logical components and 
steps involved in the scheduling process. Again, this description conforms to 
our implementation of the interfaces; others are free to substitute their own 
modules — for example, several components may be combined (e.g. the Scheduler 
or Enactor and the Monitor) for efficiency. The steps in object placement are as 
follows: 

The Collection is populated with information describing the resources (step 
1). The Scheduler queries the Collection, and based on the result and knowledge 
of the application, computes a mapping of objects to resources. This application- 
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Fig. 3. Use of the Resource Management Infrastructure 



specific knowledge can either be implicit (in the case of an application-specific 
Scheduler), or can be acquired from the application’s classes (steps 2 and 3). This 
mapping is passed to the Enactor, which invokes methods on Hosts and Vaults 
to obtain reservations from the resources named in the mapping (steps 4, 5, 
and 6). After obtaining reservations, the Enactor consults with the Scheduler to 
confirm the schedule, and after receiving approval from the Scheduler, attempts 
to instantiate the objects through member function calls on the appropriate 
class objects (steps 7, 8, and 9). The class objects report success/failure codes, 
and the Enactor returns the result to the Scheduler (steps 10 and 11). If, during 
execution, a resource decides that the object needs to be migrated, it performs an 
outcall to a Monitor, Which notifies the Scheduler and Enactor that rescheduling 
should be performed (optional steps 12 and 13). 

The remainder of this section examines each of the components in greater 
detail. 



3.1 Host and Vault Objects 

The resource management interface for the Host object appears in table 1. There 
are three broad groups of functions: reservation management, object manage- 
ment, and information reporting. 

The reservation functions are used by the Enactor to obtain a reservation 
token for each subpart of a schedule. When asked for a reservation, the Host is 
responsible for ensuring that the vault is reachable, that sufficient resources are 
available, and that its local placement policy permits instantiating the object. 
Host Object support for reservations is provided irrespective of underlying sys- 
tem support for reservations (although the Host is free to take advantage of such 
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Reservation Management 


Process Management 


Information Reporting 


make_reservation() 

check_reservation() 

cancel_reservation() 


StartObject)) 

killObject() 

deactivateObjectO 


get _compatible_vault s ( ) 
vault _OK() 



Table 1. Host Object Resource Management Interface 



facilities, if they exist). For example, the standard Unix Host Object maintains 
a reservation table in the Host Object, because the Unix OS has no notion of 
reservations. Similarly, most batch processing systems do not understand reser- 
vations, and so our basic Batch Queue Host maintains reservations in a fashion 
similar to the Unix Host Object.^ A Batch Queue Host for a system that does 
support reservations, such as the Maui Scheduler, could take advantage of the 
underlying facilities and pass the job of managing reservations through to the 
queuing system. Our real ability to coordinate large applications running across 
multiple queuing systems will be limited by the functionality of the underlying 
queuing system, and there is an unavoidable potential for conflict. We accept 
this, knowing that our Legion objects are built to accommodate failure at any 
step in the scheduling process. 

Legion reservations have a start time, a duration, and an optional timeout 
period. One can thus reserve an hour of CPU time (duration) starting at noon 
tomorrow (start time). The timeout period indicates how long the recipient has 
to confirm the reservation if the start time indicates an instantaneous reserva- 
tion. Confirmation is implicit when the reservation token is presented with the 
StartObject 0 call. Our reservations have two type bits: reuse and share. This 
allows us to build four types of reservations, as shown in table 2. A reusable 
reservation token can be passed in to multiple StartObject () calls. An un- 
shared reservation allocates the entire resource; shared reservations allow the 
resource to be multiplexed. Thus, the traditional “machine is mine for the time 
period” reservation has reuse = 1, share = 0, while a typical timesharing system 
that expires a reservation when the job is done would have reuse = 0, share = 1. 



one-shot space sharing 
(share = 0, reuse = 0) 


reusable space sharing 
(share = 0, reuse = 1) 


one-shot timesharing 
(share = 1, reuse = 0) 


reusable timesharing 
(share = 1, reuse = 1) 



Table 2. Legion Reservation Types 



® We have Batch Queue Host implementations for Unix machines, LoadLeveler, and 
Codine. 
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The object (process) management functions allow the creation, destruction, 
and deactivation of objects (object reactivation is initiated by an attempt to 
access the object; no explicit Host Object method is necessary). The StartObject 
function can create one or more objects; this is important to support efficient 
object creation for multiprocessor systems. 

In addition to the information reporting methods listed above, the Host also 
supports the attribute database included in all Legion objects. In their simplest 
form, attributes are (name, value) pairs. These information reporting methods 
for Host Objects allow an external agent to retrieve information describing the 
Host’s state automatically (the host’s state is a subset of the state maintained 
by the Host). All Legion objects include an extensible attribute database, the 
contents of which are determined by the type of the object. Host objects pop- 
ulate their attributes with information describing their current state, including 
architecture, operating system, load, available memory, etc. 

The Host Object reassesses its local state periodically, and repopulates its at- 
tributes. If a push model^ is being used, it will then deposit information into its 
known Collection(s). The flexibility of Legion object attribute databases allows 
the Host Object to export a rich set of information, well beyond the minimal “ar- 
chitecture, OS, and load average” information used by most current scheduling 
algorithms. For example, the Host could export information such as the amount 
charged per CPU cycle consumed, domains from which it refuses to accept ob- 
ject instantiation requests, or a description of its willingness to accept extra jobs 
based on the time of day. This kind of information can help Schedulers to make 
better choices at the outset, thus avoiding the computation of subtly nonfeasible 
schedules. 

The current implementation of Vault Objects does not contain dynamic state 
to the degree that the Host Object implementation does. Vaults, therefore, only 
participate in the scheduling process at the start, when they verify that they 
are compatible with a Host. They may, in the future, be differentiated by the 
amount of storage available, cost per byte, security policy, etc. 



3.2 The Collection 

The Collection acts as a repository for information describing the state of the 
resources comprising the system. Each record is stored as a set of Legion object 
attributes. As seen in figure 4, Collections provide methods to join (with an 
optional installment of initial descriptive information) and update records, thus 
facilitating a push model for data. The security facilities of Legion authenticate 
the caller to be sure that it is allowed to update the data in the Collection. 
As noted earlier. Collections may also pull data from resources. Users, or their 
agents, obtain information about resources by issuing queries to a Collection. 
A Collection query is a logical expression conforming to the grammar described 
in our earlier work [3]. This grammar allows typical operations (field matching, 

^ We are implementing an intermediate agent, the Data Collection Daemon, which 
pulls data from Hosts and pushes it into Collections. 
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semantic comparisons, and boolean combinations of terms). Identifiers refer to 
attribute names within a particular record, and are of the form SAttributeName. 



int JoinCollection(LOID joiner) ; 

int JoinCollection(LOID joiner, LinkedList <Uval_QbjAttribute>) ; 
int LeaveCollectionCLegionLOID leaver) ; 

int QueryCollectionCString Query, feCollectionData result) ; 

int UpdateCollectionEntry (LOID member, LinkedList<Uval_Obj Attribute>) ; 



Fig. 4. Collection Interface 



For example, to find all Hosts running with the IRIX operating system version 
5.x, one could use the regular expression matching feature for strings and query 
as follows:® 

match($host_os_name, “IRIX") and 
match( “5\..*" , $host_os_name) 

In its current implementation, the Collection is a passive database of static 
information, queried by Schedulers. We plan to extend Collections to support 
function injection — the ability for users to install code to dynamically compute 
new description information and integrate it with the already existing description 
information for a resource. This capability is especially important to users of the 
Network Weather Service [17], which predicts future resource availability based 
on statistical analysis of past behavior. 

3.3 The Scheduler and Schedules 

The Scheduler computes the mapping of objects to resources. At a minimum, the 
Scheduler knows how many instances of each class must be started. Application- 
specific Schedulers may implicitly have more extensive knowledge about the re- 
source requirements of the individual objects, and any Scheduler may query the 
object classes to determine such information (e.g., the available implementations, 
or memory or communication requirements) . The Scheduler obtains resource de- 
scription information by querying the Collection, and then computes a mapping 
of object instances to resources. This mapping is passed on to the Enactor for 
implementation. It is not our intent to directly develop more than a few widely- 
applicable Schedulers; we leave that task to experts in the field of designing 
scheduling algorithms. Our job is to build mechanisms that assist them in their 
task. 

® The match() function uses the Unix regexp() library, treating the first argument as 
a regular expression. Some earlier descriptions of the match () functions erroneously 
had the regular expression as the second argument. 
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Master Schedule 




Fig. 5. The Schedule data structure 

Schedules must be passed between Schedulers and Enactors. A graphical 
representation for a Schedule appears in figure 5. Each Schedule has at least one 
Master Schedule, and each Master Schedule may have a list of Variant Schedules 
associated with it. Both master and variant schedules contain a list of mappings, 
with each mapping having the type (Class LOID ^ (Host LOID x vault LOID)). 
Each mapping indicates that an instance of the class should be started on the 
indicated (Host, Vault) pair. In the future, this mapping process may also select 
from among the available implementations of an object as well. We will also 
support “k out of n” scheduling, where the Scheduler specifies an equivalence 
class of n resources and asks the Enactor to start k instances of the same object 
on them. 

There are three important data types for interacting with the Enactor: the 
LegionScheduleFeedback, LegionScheduleList, and LegionScheduleRequestList. 
A LegionScheduleList is simply a single schedule (e.g. a Master or Variant sched- 
ule). A LegionScheduleRequestList is the entire data structure shown in figure 
5. LegionScheduleFeedback is returned by the Enactor, and contains the origi- 
nal LegionScheduleRequestList and feedback information indicating whether the 
reservations were successfully made, and if so, which schedule succeeded. 



3.4 The Enactor 

The pertinent portion of the Enactor interface appears in figure 6. A Scheduler 
first passes in the entire set of schedules to the make_reservations() call, and 
waits for feedback. If all schedules failed, the Enactor may (but is not required 
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to) report whether the failure was due to an inability to obtain resources, a 
malformed schedule, or other failure. If any schedule succeeded, the Scheduler 
can then use the enact_schedule() call to request that the Enactor instantiate 
objects on the reserved resources, or the canceLreservations() method to release 
the resources. 



feLegionScheduleFeedback make_reservations(&LegionScheduleList) ; 
int cancel_reservations(&LegionScheduleRequestList) ; 

fcLegionScheduleRequestList enact_schedule (ftLegionScheduleRequestList) ; 



Fig. 6. Enactor Interface 



We have mentioned master and variant schedules, but have not explained 
how they are used by the Enactor. Each entry in the variant schedule is a single- 
object mapping, and replaces one entry in the master schedule. If all mappings 
in the master schedule succeed, then scheduling is complete. If not, then a vari- 
ant schedule is selected that contains a new entry for the failed mapping. This 
variant may also have different mappings for other instances, which may have 
succeeded in the master schedule. Implementing the variant schedule entails 
making new reservations for items in the variant schedule and canceling any cor- 
responding reservations from the master schedule. Our default Schedulers and 
Enactor work together to structure the variant schedules so as to avoid reserva- 
tion thrashing (the canceling and subsequent remaking of the same reservation) . 
Our data structure includes a bitmap field (one bit per object mapping) for each 
variant schedule which allows the Enactor to efficiently select the next variant 
schedule to try. This keeps the “intelligence” where it belongs: under the control 
of the Scheduler implementer. 

As mentioned earlier. Class objects implement a create_instance() method. 
This method has an optional argument containing an LOID and a reservation 
token. Use of the optional argument allows directed placement of objects, which 
is necessary to implement externally computed schedules. The Class object is 
still responsible for checking the placement for validity and conformance to local 
policy, but the Class does not have to go through the standard placement steps. 

3.5 Application Monitoring 

As noted earlier. Legion provides an event-based notification mechanism via its 
RGE model [13]. Using this mechanism, the Monitor can register an outcall 
with the Host Objects; this outcall will be performed when a trigger’s guard 
evaluates to true. There is no explicitly-defined interface for this functionality, 
as it is implicit in the use of RGE facilities. In our actual implementation, we have 
no separate monitor objects; the Enactor or Scheduler perform the monitoring, 
with the outcall registered appropriately. 
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4 Examples of Use 

We now give an example of a Scheduler that uses our resource management in- 
frastructure. While it does not take advantage of any application-specific knowl- 
edge, it does serve to demonstrate some of the flexibility of the mechanisms. We 
start with a simple random policy, and demonstrate how to build a “smarter” 
Scheduler based on the simple random policy. This improved Scheduler provides 
a template for building Schedulers with more complex placement algorithms. We 
then discuss our plans for building more sophisticated Schedulers with applica- 
tion and domain-specific knowledge. 

The actual source code for the default Legion Scheduler was too voluminous 
to include here. For the sake of brevity and to keep the focus on the facili- 
ties provided by Legion rather than the details of a simple random Scheduler, 
we have presented pseudocode. The source code is contained in release 1.5 of 
the Legion system, released in January 1999. The current release of the Le- 
gion software is available from [9], or by contacting the authors via e-mail at 
legion@cs.virginia.edu. 

4.1 Random Scheduling 

The Random Scheduling Policy, as the name implies, randomly selects from 
the available resources that appear to be able to run the task. There is no 
consideration of load, speed, memory contention, communication patterns, or 
other factors that might affect the completion time of the task. The goal here is 
simplicity, not performance. 

Pseudocode for our random schedule generator in figure 7. The Generate_Ran- 
dom_Placement() function is called with a list of classes for which instantiation is 
desired. The Scheduler iterates over this list, and executes the following steps for 
each item. First, the Scheduler extracts the list of available implementations from 
the Class Object it is to instantiate. The Scheduler then queries the Collection 
for matching Hosts, and picks a matching Host at random. After extracting that 
Host’s list of compatible Vaults from the description returned by the Collection, 
the Scheduler randomly selects a vault. This (Host, Vault) pair is added to the 
master schedule. This pair selection is done once for each instance desired for 
this class. 

Note that this algorithm only builds one master schedule, and does not take 
advantage of the variant schedule feature, nor does it calculate multiple sched- 
ules. The Scheduler could call this function multiple times to generate additional 
master schedules. This is not efficient, nor will it necessarily generate a near- 
optimal schedule, but it is simple and easy. This is, in fact, the equivalent of the 
default schedule generator for Legion Classes in releases prior to 1.5. 

After generating the mapping, the Scheduler must interact with the Enactor 
to determine if the placement was successful. Although not shown in figure 7, 
the simple implementation passes a single master schedule to the Enactor via 
the make_reservations() and enact_schedule() methods, and reports the success or 
failure of that call back to the object that invoked the Scheduler. No attempt 
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Generate_Random_Placement(ObjectClass list) { 
for each ObjectClass O in the list, do { 

query the class for available implementations 
query Collection for Hosts matching available implementations 
k = the number of instances of this class desired 
for i := 1 to fc, do { 

pick a Host at random 

extract list of compatible vaults from Tl 

randomly pick a compatible vault V 

append the target (7f, V) to the master schedule 

} 

} 

return the master schedule 

} 



Fig. 7. Pseudocode for random placement 

is currently made to generate other placements, although a more sophisticated 
Scheduler would certainly do so. 



4.2 Improved Random Scheduling (IRS) 

There are many possible improvements on our random placement algorithm, 
both for efficiency of calculation and for efficacy of the generated schedule. The 
improvement we focus on is not in the basic algorithm; the IRS still selects a 
random Host and Vault pair. Rather, we will compute multiple schedules and 
accommodate negative feedback from the Enactor. The pseudocode for IRS is 
in figures 8 and 9. 

The improved version generates n random mappings for each object class, and 
then constructs n schedules out of them. The Scheduler could just as easily build 
n schedules through calls to the original generator function, but IRS does fewer 
lookups in the Collection. Note also that, because this is random placement, 
we do not consider dependencies between objects in the placement. A more 
sophisticated Scheduler would take this into account either when generating 
the individual instance mappings or when combining instance mappings into a 
schedule. 

The Wrapper function has three global variables that limit the number of 
times it will try to generate schedules, the number of times it will attempt to 
enact each schedule, and the number of variant schedules generated per call to 
the generation function.® Again, this is a simple-minded approach to solving the 
problem, but serves to demonstrate how one could construct a richer Scheduler. 

® We realize that the value returned from the generator and passed to the Enactor 
should be a list of master schedules; we take liberty with the types in the pseudocode 
for the sake of brevity. 
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IRS_Gen_Placement(ObjectClass list, int n) { 
for each ObjectClass O in the list, do { 

query the class for available implementations 
query Collection for Hosts matching available implementations 
k = the number of instances of this object desired 
for I := 1 to n, do { 

for i := 1 to k, do { 

pick a Host Ti. at random 

extract list of compatible vaults from Ti. 
randomly pick a compatible vault V 

append the target (7f, V) to the list for this instance 

} 

} 

} 

master sched. = first item from each object inst . list 
for I := 2 to n, do { 

select the component of the list for each object instance 
construct a list of all that do not appear in the master list 
append to list of variant schedules 

} 

return the master schedule 

} 



Fig. 8. Pseudocode for the IRS Placement Generator 



IRS_Wrapper(ObjectClass list) { 

for i in 1 to SchedTryLimit , do { 

sched = IRS_Gen_Placement (ObjectClass List, NSched) ; 
for j in 1 to EnactTryLimit , do { 

if (make_reservations (sched) succeeded) { 
if (enact_placement (sched) succeeded) { 
return success; 

} 

} 

} 

} 

return failure; 

} 



Fig. 9. Pseudocode for the IRS Wrapper 
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4.3 Specialized Policies 

We are in the process of defining and implementing specialized placement policies 
for structured multi-object applications. Examples of these applications include 
MPI-based or PVM-based simulations, parameter space studies, and other mod- 
eling applications. Applications in these domains quite often exhibit predictable 
communication patterns, both in terms of the compute/communication cycle 
and in the source and destination of the communication. For example, we are 
working with the DoD MSRC in Stennis, Mississippi to develop a Scheduler 
for an MPI-based ocean simulation which uses nearest-neighbor communication 
within a 2-D grid. 

5 Related Work 

The Globus project [-5] is also building metacomputing infrastructure. At a high 
level, their scheduling model closely resembles that of Legion, as we first pre- 
sented it at the 1997 Legion Winter Workshop [2]. There is a rough correspon- 
dence between Globus Resource Brokers and Legion Schedulers; Globus Informa- 
tion Services and Legion Gollections; Globus Go-allocators and Legion Enactors; 
and Globus GRAMs and Legion Host Objects. However, there are substantial 
differences in realization of the model, due primarily to two features of Legion 
not found in Globus: the object-oriented programming model and strong sup- 
port for local autonomy among member sites. Legion achieves its goals with 
a “whole-cloth” design, while Globus presents a “sum-of-ser vices” architecture 
layered over pre-existing components. Globus has the advantage of a faster path 
to maturity, while Legion encompasses functionality not present in Globus. An 
example of this is in the area of reservations and schedules. Globus has no intrin- 
sic reservation support, nor do they offer support for schedule variation — each 
task in Globus is mapped to exactly one location. 

There are many software systems for managing a locally-distributed multi- 
computer, including Gondor [11] and LoadLeveler [14]. These systems are typi- 
cally Queue Management Systems intended for use with homogeneous resource 
pools. While extremely well-suited to what they do, they do not map well onto 
wide-area environments, where heterogeneity, multiple administrative domains, 
and communications irregularities dramatically complicate the job of resource 
management. Indeed, these types of systems are complementary to a metasys- 
tem, and we will incorporate them into Legion by developing specialized Host 
Objects to act as mediators between the queuing systems and Legion at large. 

SmartNet [7] provides scheduling frameworks for heterogeneous resources. 
It is intended for use in dedicated environments, such as the suite of resources 
available at a supercomputer center. Unlike Legion, SmartNet is not intended 
for large-scale systems spanning administrative domains. Thus, SmartNet could 
be used within a Legion system by developing a specialized Host Object, similar 
to the Gondor and LoadLeveler Host Objects mentioned earlier. IBM’s DRMS 
[12] also provides scheduling frameworks, in this case targeted towards recon- 
figurable applications. The DRMS components serve functions similar to those 
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of the Legion RMI, but like SmartNet, DRMS is not designed for wide-area 
metacomputing systems. 

6 Conclusions and Future Work 

This paper has described the resource management facilities in the Legion meta- 
computing environment, including reservations and schedule handling mecha- 
nisms. We have focused on the components of the resource management subsys- 
tem, presented their functionality, and described the interfaces of each compo- 
nent. Using these interfaces, we have implemented sample Schedulers, including 
a simple random Scheduler and a more sophisticated, but still random. Sched- 
uler. These sample Schedulers point the way to building more complex and 
sophisticated Schedulers for real-world applications. 

We are in the process of benchmarking the current system so that we can 
measure the improvement in performance as we develop more intelligent Sched- 
ulers. We are developing Network Objects to manage communications resources. 
The object interfaces will evolve in response to need — as we work with our re- 
search partners who are developing scheduling algorithms, we will enrich both 
the content and capability of the Resource Management Infrastructure and the 
Legion core objects. 
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Abstract. The main advantage of a metacomputer is not its peak per- 
formance but better utilization of its machines. Therefore, efficient schedul- 
ing strategies are vitally important to any metacomputing project. A real 
metacomputer management system will not gain exclusive access to all 
its resources, because participating centers will not be willing to give up 
autonomy. As a consequence, the scheduling algorithm has to deal with a 
set of local sub-schedulers performing individual machine management. 
Based on the proposal made by Feitelson and Rudolph in 1998 we de- 
veloped a scheduling model that takes these circumstances into account. 

It has been implemented as a generic simulation environment, which we 
make available to the public. Using this tool, we examined the behavior 
of several well known scheduling algorithms in a metacomputing sce- 
nario. The results demonstrate that interaction with the sub-schedulers, 
communication of parallel applications, and the huge size of the meta- 
computer are among the most important aspects for scheduling a meta- 
computer. Based upon these observations we developed a new technique 
that makes it possible to use scheduling algorithms developed for less 
realistic machine models for real world metacomputing projects. Simula- 
tion runs demonstrate that this technique leads to far better results than 
the algorithms currently used in metacomputer management systems. 



1 Introduction 

Ever since 1992 when the term metacomputing was coined by Larry Smarr and 
Charles E. Catlett [42], researchers all over the world have been working with 
increasing effort on this promising concept. The most obvious advantage of a 
metacomputer is its tremendous computing power. Theoretically, a nation wide 
cluster can provide more FLOPS than any single machine has ever been able 
to achieve. However, after six years of research the first wave of enthusiasm has 
passed and researchers realized that many problems have to be solved before 
the idea of Smarr and Catlett can become reality. Especially the Information 
Wide Area Year [12] has not only demonstrated the potential of the meta- 
computing concept but also its limitations. Particularly on the application side, 
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there exist only few programs that can fully exploit the accumulated power of 
a large heterogeneous machine pool. For most real world applications efficiency 
- and sometimes even performance - decreases dramatically, if they are exe- 
cuted on a cluster of WAN distributed machines. Consequently, the enormous 
efforts necessary for building a metacomputer cannot be justified by its theoret- 
ical peak performance. During the recent past, other benefits of metacomputing 
have become more and more apparent. One of the most important aspects of 
a nationwide computational grid is its ability to share load more evenly and 
achieve better utilization of the available hardware. 

Nowadays people show a strong tendency to run their jobs on those machines 
that are locally available. Only if these resources are absolutely insufficient, they 
are willing to go through the trouble of applying for an account at a remote 
center with more powerful machines. If access to a remote machine was as easy 
as to HPC resources at the local site, users would be more flexible in choosing 
their target hardware. As a consequence, the total amount of job requests would 
spread more evenly over the available machines. Even more important: computer 
systems with special features could more easily be kept free for those applica- 
tions that can take advantage of these. Software with less demanding hardware 
requirements could be run on general purpose machines without loosing effi- 
ciency while code that was tuned for a particular system could be executed 
faster on the appropriate hardware. As a consequence, the implementation of 
a metacomputer could effectively increase the amount of available computing 
power without the need to add extra hardware. 

If the primary goal of building a metacomputer is to optimize resource us- 
age, then its scheduling component becomes a critical part. From the scheduling 
perspective, a metacomputing management system can be divided into two dis- 
tinct layers. The bottom layer consists of the primary infrastructure that links 
the distributed machines together and manages their information exchange. On 
top of this there is the scheduling layer that decides when, how, and where jobs 
are executed. Obviously, there has to be close cooperation between the sched- 
uler and the underlying infrastructure. During the past, researchers often worked 
independently on these two layers. As a consequence, there is now a large gap be- 
tween research results achieved in both areas [19,32]. The best known scheduling 
algorithms, were studied using analytical models that are not directly compati- 
ble with the infrastructure implemented in the various metacomputing projects. 
It is therefore the purpose of this work to show a possible way of bridging this 
gap and to qualify the inevitable performance decreases. 

The rest of this paper is organized as follows. In Sec. 2 we give an overview of 
current research in the field of metacomputer environments and relevant schedul- 
ing strategies. We describe our machine model in Sec. 3 and the workload model 
as well as the evaluation criteria in Sec. 4. Sec. 5 outlines the examined scheduling 
algorithms and describes the most important simulation results. Furthermore, 
a new technique is shown for that makes it possible to use scheduling algo- 
rithms developed for less realistic machine models for real world metacomputing 
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projects. Finally, in Sec. 7 we summarize the results, draw conclusions for the 
scheduling community, and provide an outlook on our future work. 

2 Background 

Many fields of research are contributing to the implementation of a metacom- 
puter. For example, networking, security, interface design, load balancing, map- 
ping, distributed system design, or even multimedia are all important aspects of 
this field of research. In this paper we concentrate on what has been achieved in 
infrastructure oriented projects and on scheduling algorithms that are applicable 
to such an environment. 

2.1 Infrastructure 

Although there exists a definition of ’’metacomputing” in [42] that is frequently 
referenced, there is no broad agreement about how a metacomputer should look 
like. A lot of projects have been set up since 1992 which rightfully claim to be 
working in the field of metacomputing. However, some of their approaches and 
goals differ significantly. We found that the majority of these projects can be 
divided into five distinct categories: 

HPC Metacomputing is the most traditional approach. Projects of this category 
are targeting at connecting high performance computers via wide area networks. 
Supercomputers are typically installed at different locations all over a nation. As 
a consequence, these projects do not only have to deal with technical problems 
but also with the political aspects arising from the necessary sub-ordination 
of supercomputer centers under any form of metacomputer management. The 
machines forming the resource pool are heterogeneous in hardware and software 
(e.g. operating system). During the past, most representatives of this category 
of metacomputing systems had a centralized architecture which incorporated 
a global scheduling module with full control over the available resources. Due 
to political and technical considerations (e.g. fault tolerance) nowadays most 
projects favor a more distributed approach. [20,9,36,44,31] are among the most 
prominent initiatives of this category. 

Cluster Computing has developed very fast during the last six years. The goal of 
these projects is to connect a large number of workstation sized computers as a 
cheap alternative to supercomputers. Although some are also trying to connect 
resources of different administrative entities (e.g. universities), a more typical 
use is a building wide metacomputer. Most of these systems have support of 
heterogeneous hardware but put restrictions on software heterogeneity. Since 
there is much economical potential in this area, there are several commercial 
products available offering a high degree of robustness. However, compared to 
HPC-metacomputing these systems focus more on distributing sequential jobs 
and less on efficiently scheduling parallel applications. Some of the best known 
cluster computing environments are [25,33,23,5]. 
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Single Site Metacomputing is similar to HPC metacomputing. The main differ- 
ence is the restriction to only one administrative entity. Many supercomputing 
centers have been enhancing their local management systems to provide meta- 
computing services on top of the local machine pool. From the algorithmic point 
of view, this problem is almost identical to HPC metacomputing. However, re- 
sults can be achieved faster, because there are less political problems and a 
centralized software architecture is feasible. [21,1,27] are some examples of these 
projects. 

Object Oriented Metacomputing is increasingly gaining importance in the meta- 
computing community. Applications that have been implemented using an object 
oriented programming paradigm can easily be ported to the distributed runtime 
environment of a metacomputer. Especially the availability of the CORBA stan- 
dard [41] has had strong effects on the development of distributed object oriented 
environments. CORBA provides full support of heterogeneity in hard- and soft- 
ware as well as a possible technical solution to the interconnection of several 
sites. Since each site within such a system represents itself by one or more ob- 
ject request brokers, CORBA also indicates a possible way to overcome some of 
the political obstacles. This is not the least reason why similar approaches have 
been adopted by several HPC metacomputing projects. The major disadvantage 
of object oriented metacomputing is related to the fact that currently only few 
relevant applications have been programmed in object oriented languages. 

Seamless Computing is the most pragmatic approach and probably not truly 
metacomputing. Projects of this class are working on standardizations of super- 
computer access interfaces. Nowadays, almost any supercomputing center has 
its own access modalities and users have to learn and remember how jobs are 
submitted to each of the available machines. Therefore, there is a strong de- 
mand for a unified access interface. Some of the projects working in this area 
are described in [3,28,6]. 

Our institute, the Paderborn Center of Parallel Computing, is involved in 
a couple of projects that deal with interconnecting supercomputers [30,22,35]. 
Thus, our work concentrates mainly on aspects of HPC metacomputing. The 
model described in Sec. 3 is therefore derived from this category. 



2.2 Job Scheduling for Metacomputing 

Much research has been done in the field of parallel job scheduling. A good 
overview about the current state can be found in [19]. Unfortunately, many re- 
searchers have been using their own nomenclature for specifying the capabilities 
of their algorithms. In 1997 Feitelson et al. proposed a common set of terms for 
describing scheduling strategies [19]. We will use these definitions throughout 
the rest of this paper. 

According to [19], scheduling models can be classified according to partition 
specification, job flexibility, level of preemption, amount of job and workload 
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knowledge available, and memory allocation. Clearly, a metacomputer is a dis- 
tributed memory system (although some of its nodes may be shared memory 
machines). The other criterions are less easy to define. Given the current prac- 
tice and the results of Smith, Foster, and Taylor in [43], it seems sensible to allow 
the expected application runtime as the only job related information available 
to the scheduler. Newer management software packages such as CCS [21] allow 
much more complex job descriptions, but these systems cairnot be expected to be 
broadly available in the near future. On the machiire level handling of partition 
sizes ranges from fixed partitions on marry batch systems to dynamic parti- 
tions on moderir SMP architectures. On the metacomputer level it seems best to 
define partition sizes according to job requirements and curreirt load situatioir 
before an application is being executed. The effort of implementing mechanisms 
for re-adjusting partition sizes during run time seems inappropriate considering 
the small number of applications that currently support such a feature. There- 
fore, we are using variable partitions in our metacomputer model. The same 
arguments also hold for job flexibility. Several parallel applications can run on 
differeirt partition sizes but only few support dynamic re-adjustmeirt during ruir 
time. Therefore, we are looking for a scheduler that can efficiently manage a 
mix of moldable and rigid jobs (strictly speaking, rigid jobs are moldable jobs 
with a fixed degree of parallelism). The most difficult decision was, whether 
the metacomputer shall support preemption. Up to irow, only few job manage- 
ment systems provide support for preemptive scheduling. However, many long 
running scientific applications (i.e. applications well suited for metacomputing) 
have internal checkpointing and restart facilities. Therefore, it would not be too 
difficult to implement coarse grain preemptive scheduling for these applications. 
Especially since preemption seems to be a promising concept for scheduling of 
parallel jobs [38]. Nevertheless, current versions of the major metacomputing 
enviroirmeirts - iircluding our owir MOL system [35] ” do not support preemp- 
tion and therefore we have so far only considered rriir-to-completion scheduling 
algorithms. 

Typically, scheduliirg disciplines are distiirguished as online or offline algo- 
rithms. However, as we will show below, this differeirce becomes less important 
iir the metacomputer sceirario. Therefore, both types of algorithms are poten- 
tial candidates for a metacomputer scheduler. Since the problem of scheduling 
parallel tasks on multiple parallel machines has been proven to be computa- 
tionally hard for most cases, the most commonly used algorithms are based 
on heuristics. Although several papers have been published on using general- 
purpose heuristics for scheduling [37,8], better results are usually obtained by 
more problem specific strategies. The latter can be divided into two categories. 
On the one hand, there are algorithms derived from analytical considerations 
like for example those discussed in [29]. Often, these strategies are difficult to 
be implemented but guarantee a good worst case behavior. On the other hand, 
there are several fairly simple heuristics that have been tested in simulation en- 
vironments or sometimes even as part of real world scheduling systems [32,4,10]. 
These approaches do not have such a good worst case behavior, but the results 
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for ’’typical” problems are often quite good [29]. We decided to concentrate our 
examinations on those strategies that are not too complex for implementation 
and for which good performance has been demonstrated with real workloads. 
Although there are still a lot of algorithms in this class, many of them are based 
on three fundamental approaches: FCFS [39], FFIH [34], and ASP [40]. In Sec. 5 
we provide a closer look at these strategies and show, if and how they can be 
applied to a metacomputing scenario. 



3 Modeling the Metacomputer Infrastructure 

Our main motivation was to find a scheduling algorithm that can be imple- 
mented on top of the NRW metacomputing infrastructure [30]. One of the key 
concepts of this project is to maintain autonomy for the participating super- 
computing centers. Especially, it was not possible to replace existing resource 
management systems like NQS [2], NQE [11], LoadLeveler [26], or CCS [21] by 
a global scheduling module. The metacomputer management facilities had to 
be installed on top of the existing environments. It should be emphasized that 
this is not only a problem of the NRW metacomputer. Other metacomputing 
projects like GLOBUS [20], POLDER [44,31], or the Northrhine- Westphalian 
Metacomputer [36] had to use similar approaches. As a consequence the meta- 
computer scheduling algorithm cannot access the available machines directly 
but has to submit its jobs to the local queuing systems. Furthermore, users with 
direct access to the machines cannot be forced to submit all their jobs via the 
metacomputer. Especially since metacomputer frontends are designed to provide 
easy access to a heterogeneous machine pool and therefore often do not support 
the more advanced features of the hardware. Thus, local users will continue 
submitting jobs to the machines directly, thereby bypassing the metacomputer 
scheduler. Fig. 1 depicts this model. Each computer is controlled by a local re- 
source management system (RMS). The metacomputer scheduler has no direct 
access to the hardware. It can only submit (sub-)jobs via each center’s local queu- 
ing environment. In general, there are two different types of resource requests 
that can be sent to such a system. On the one hand, these are metacomputer 
jobs that are submitted to the metacomputing scheduler. On the other hand, 
there are local requests that are submitted directly to the resource management 
software without using the metacomputer facilities. The amount of information 
that is communicated from the local queuing systems to the metacomputer is a 
parameter of the model. Basically, there are three different levels of information 
exchange possible: 

1. The metacomputer has no knowledge about the locally submitted jobs. It 
has no information, if (or how many) resources have already been assigned 
by the RMS and how much is left for the metacomputer {scheduling with no 
control) . 

2. The meta-scheduler can query the local queuing systems about the resources 
assigned to local jobs. This may include jobs that have already been submit- 
ted but are not yet scheduled. This type of information can be unreliable. 
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Fig. 1. The metacomputer accesses the machines through local resource man- 
agement systems 



because the jobs’ resource requirements are typically not exactly known in 
advance {scheduling with limited control). 

3. Whenever a job is submitted to a local management system, it is forwarded 
to the metacomputer for scheduling {scheduling with full control). This is 
the model that has been used in most theoretical papers on metacomputer 
scheduling. 

It is obvious that a scheduling algorithm designed for model 1 cannot perform 
better than the same strategy working in scenario 2. The same holds for models 
2 and 3. Consequently, scenario 1 can be expected to deliver the worst results. 
However, this approach is currently used in most real world metacomputer man- 
agement systems [20,36,31]. The most promising model is the approach that gives 
full control to the metacomputer scheduler. Unfortunately, due to the political 
reasons explained above, this approach is unlikely to become reality. Therefore 
we examined the performance decreases that happen to scheduling algorithms 
developed for the full control model, if they are used in a no control environ- 
ment. Since worst case analysis can sometimes be misleading, if the performance 
of algorithms applied to real world problems shall be determined [19,24], we base 
our comparisons on simulation runs. In order to obtain results that are useful for 
other researchers, too, we derived our simulated environment from the suggested 
standard workload model that was proposed by Feitelson and Rudolph in [18]. 
A detailed description of the workload model is given in Sec. 4. 
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Besides the scheduling hierarchy, the configuration of the heterogeneous ma- 
chine pool is another important aspect of the metacomputer model. For our 
work, we employ the concept of application centric metacomputing described in 
[22]. Therefore, we may assume that in principle any job can be fulfilled by each 
machine. The only heterogeneity that is visible to the metacomputer scheduler is 
caused by performance differences between the computing nodes. For simplicity, 
we will use performance and number of processors of a node as synonymous for 
the rest of this paper (see [22]). What still remains to be defined is the distri- 
bution of machine sizes (i.e. performances) within the metacomputer. Since we 
are researching on metacomputing with supercomputers, we used the Top 500 
Report of Jack Dongarra et al. as a reference [13]. Fig. 2 depicts the performance 
distribution of the world’s 500 fastest supercomputers. As can be seen, perfor- 




Fig. 2. Performance distribution in Top500 list 



mance follows roughly a uniform logarithmic distribution and the probability of 
a machine having less than P processors can be estimated by 



p{Procs < P) = 2 ln{P) + c 



( 1 ) 
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4 Workload Model and Evaluation Criteria 

The model for evaluating our scheduling strategies is derived from the suggestion 
made by Feitelson and Rudolph in [18]. Their work was a first step towards a 
unified benchmark that will make it easier to compare the quality of different 
scheduling algorithms. Currently, there exists a broad variety of working sce- 
narios and evaluation metrics used by different authors to prove the quality of 
their algorithms. This often makes it difficult - if not impossible - to compare 
performance values of different algorithms. However, the suggested model does 
not provide a clearly defined benchmark but contains several topics that are still 
subject of ongoing research. In order to implement a simulation environment, we 
had to fill these gaps with what seemed to be the best possible solution at this 
time. In the following we give a formal description of the workload model and the 
criteria that were used for evaluating the metacomputer scheduling techniques. 



4.1 Job Arrivals 

As in [18] we use an open on-line model where jobs are continuously fed into 
the system while the scheduler is working. The rate at which new jobs enter the 
metacomputer is modeled by a polynomial of degree 8 as described in [7]. Thus, 
the arrival rate at any given time t is 

A(t) = 3.1 - 8.5t -b 24.7 f + 130.8t^ -b 107. 7t* + 804.2f® 

-b2038.5t® -b 1856.8t^ -b 4618.6t® (2) 

with —0.5 < t < 0.5 representing the time span from 8:30 AM to 6:00 PM. 
During the remaining time, we estimate an arrival rate of one job per hour for 
normal load situations. In order to obtain scheduling performance as a function 
of load, we multiply this arrival rate by a compression factor in order to simulate 
different load situations. 

Obviously, this way of modeling job arrivals is derived from observations 
made on stand-alone supercomputers. At a nation wide or even global meta- 
computer jobs will probably arrive in a more uniform distribution. 



4.2 Job Sizes 

Estimation of job parallelism is based on the suggestions made in [18] and [14]. 
This means that a distinction is made between jobs using a number of processors 
that is a power of two and the remaining ones. Both [18] and [14] observed that 
powers of two are much more often requested than other processor numbers. 
Hence, it was suggested to use a model where the probability of using fewer 
than n processors is roughly proportional to logn and modify it to contain extra 
steps where n is a power of two. The height of these steps is derived from the 
workload parameter p 2 that determines the fraction of jobs using partition sizes 
that are powers of 2. Feitelson and Rudolph observed values of P 2 around 81% 
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but since it is unclear, if this is going to decrease in the future, it was suggested 
to integrate it as a parameter rather than a constant. 

The second dimension of a job’s size is its execution time. Results reported 
in [16] and [14] indicate that runtimes of jobs on large systems have a wide 
distribution with a high degree of variation. This may be the reason why there 
is currently no model available that leads to convincing results for all the ac- 
counting data that we have examined [21,17]. But there is evidence that the 
traces made available by some supercomputers show a mixture of two different 
job classes. On the one hand, there are short running jobs that have only a 
small degree of parallelism. These jobs are typically generated during the devel- 
opment cycle. On the other hand, there are long running production runs that 
require large amounts of resources. This effect cannot be seen equally well on 
all available traces and especially for a metacomputer we expect only a small 
number of development runs. Consequently we decided to follow the suggestion 
made in [18] and [14] for the simulation of batch production runs and assume 
uniform logarithmic distribution for the execution times. This means that for 
any given job requesting A processors the probability that its execution time on 
a monolithic machine lasts t units is 

P«) = T (3) 



4.3 Internal Job Structure 

The model of the internal job structure proposed in [18] is very generic and needs 
some refinement in order to be applicable to our scenario. In the metacomputing 
context the most important properties of a job are scalability and sensitivity 
towards slowdown of selected communication links. Preliminary surveys among 
users of the Metacomputer Online Initiative [35] and the Northrhine- Westphalian 
Metacomputer Task-force [36] indicate domination of two distinct job classes: 

1. rigid jobs with frequent barrier synchronization or grid structured commu- 
nication and 

2. malleable applications with only few communications that are programmed 
using a client/server paradigm. 

In the following, we will refer to class 1 as synchronous jobs and to jobs of class 2 
as asynchronous. The fraction Psync of synchronous jobs to all jobs is a parameter 
of the model. 

For our scheduler it is important to be able to calculate the expected run- 
time of a job according to the assigned amount of processors and the number 
of machines used. The latter is relevant because the number of machines deter- 
mines the number of slow WAN network connections a job has to cope with. 
Besides the fact that synchronous jobs imply a fixed degree of parallelism, the 
main difference between the two job classes is the dependency between avail- 
able communication performance and expected run time. Due to the frequent 
synchronizations performed by synchronous jobs, the runtime performance is 
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determined by the speed of the slowest communication link. This means that, 
if a synchronous job is mapped onto multiple computers, the actual number of 
machines used has no effect on the expected execution time. If there is one wide 
area network link involved, it dominates the communication behavior of the ap- 
plication completely. Hence, the parallel execution time of such a request r that 
uses Ci processors on each machine i G {0,...,M — 1} can be derived as 



Tiincgync (r, (cq, . ■ . , cm—i')') ■ — 



Seq{r) 



( p. 



1 + Comm(r) 



f BWmpp 
\BWwan 



- 1 • (/is. 



ync 



with 



4>sync ■= min ( Card ({c G {cq, . . . , Cm-i} | c> 0}) - 1 , 1 ) 



(4) 



where Pj denotes the number of processors available on machine j, BWwan 
represents the bandwidth of a wide area communication link, BWmpp the band- 
width within each of the parallel machines, and Sr{n) the speedup for n proces- 
sors. Seq{r) denotes the runtime on a single processor and Comm{r) the fraction 
of the overall execution time that is consumed by communication, if the job is 
executed on a number of processors that corresponds to its average parallelism. 
Since it is unknown, how the distribution of Comm(r) in a representative set of 
metacomputer jobs will look like, we decided to assume a uniform distribution 
within the range of 0 < Comm(r) < 0.5. 

As a consequence of (4.3) the scheduler should avoid splitting a synchronous 
job onto multiple machines wherever possible. However, if it has to be split, then 
the algorithm is free to choose an arbitrary number of different machines. 

The effect of a partial communication slowdown on an asynchronous job is 
less dramatic than in the synchronous case. Here we assume a star-like commu- 
nication on the WAN links. In the client/server paradigm this means that all 
the work packages are generated on one single machine from which they are dis- 
tributed to the worker nodes. Arbitrary communication topologies can be used 
within each of the machines. Therefore, the parallel runtime of a synchronous 
job is less sensitive towards a partitioning onto multiple computers. The slow- 
down in communication is proportional to the amount of data that has to be sent 
across wide area links. Assuming that this is again proportional to the amount of 
processors on the other side of link, the expected execution time can be obtained 
from 
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So far we have not explained, how the speedup function Sr{n) shall be de- 
fined. Our solution follows the suggestions made by Downey in [15]. His model 
compares well with real life workload traces but incorporates only few free pa- 
rameters. Downey characterizes the speedup behavior of a program by its average 
parallelism A and the variance in parallelism cr with 



Sr (n) 



/ An 

An 

o-(A-i)+n(l-f ) 

nA{cr-\-l) 

A-\-A<7—<7-\-n<7 

A 



0<cr <1 , 1 <n < A 
0< a <1 , A<n<2A-l 
a > 1 , 1 < n < A + Aa — a 
else 



(6) 



Typical values for a are in the range of 0 < cr < 2. Although no extensive study 
has been made on the distribution of a in real workloads, Downey suggests in 
[14] to use a uniform distribution. 



4.4 Evaluation Criteria 

Although much research has been done on scheduling, there is still no final 
answer to the question, which metrics shall be used for measuring the quality of a 
schedule [18,19]. This is because different people have different requirements. For 
example, managements of computing centers have strong interests in optimizing 
the overall throughput of their machines. The users of these centers however, 
like to obtain good response times for their jobs. Since managers of computing 
centers should also be interested in satisfying the needs of their users, it might be 
a good idea to search for optimal response time schedules with good throughput. 
However, response time measurement can lead to irritating results since it tends 
to overemphasize the effects of small jobs. Nevertheless, we decided to focus 
primarily on optimizing response time since it is used in many papers dealing 
with on-line scheduling and it is at least considered a sensible metrics in [18] 



5 The Algorithms 

An analogy that is often used for describing scheduling algorithms represents 
machines as two dimensional buckets. The width of such bucket is related to 
the total amount of resources on the corresponding machine and the height 
represents time (see Fig. 3). Each request that has been scheduled for a single 
machine can therefore be represented by a rectangle defined by the number of 
processors allocated to that job (width of the rectangle) and the expected run 
time (height of the rectangle). For the rest of this paper we will use Freemit) 
as the amount of free resources on machine m at time t. Furthermore, we define 
Free^{t) as 

Free’^{t) := mm{Freem (tOI 



(7) 
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requests 



Fig. 3. Job schedule on a parallel machine and the corresponding surface 

Free!^ will also be called the surface of m. In Fig. 3 the surface corresponds to 
the dashed line. 

The same analogy can be used for the metacomputer scenario (see Fig. 4). 
For this case we define the surface of the whole machine pool M as Free\^{t) 
with 

FrecMit) := ^ Free*^{t) (8) 

m£M 



surface of metacomputer 




Fig. 4. Constructing the metacomputer’s surface from machine surfaces 
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5.1 Scheduling with Full Control The Idealized Scenario 

A scenario that offers full control to the metacoinputing scheduler cair be 
achieved by denying direct submission of jobs to the local queuiirg systems. 
As a consequence, the scheduler has complete knowledge about all jobs in the 
system which makes metacomputer scheduling in this scenario similar to schedul- 
ing a single parallel machine. If the speed of the Internet connections between 
the machines was the same as internal communication, the problem would be 
identical to the job management of a single parallel machine. However, Internet 
communication is much slower and the algorithm has to take this into account. 
As explaiired iir Sec. 2.2 we have choseir scheduling algorithms based oir FCFS, 
FFIH, and ASP for further iirvestigatioir. 

FCFS First Come First Serve is a very simple approach. When a request for 
r resources is submitted to the system at time s, it will be coirfigured at time t 
with t = min{P > s \ Free\j(t') > r}. The choice which - and how many ~ 
machiires shall be used is based upoir a greedy strategy. If the request represents 
an asynchroirous application it is divided iirto as few parts as possible in order 
to decrease its overall execution time. If the program to be launched belongs 
to the synchronous class, the algorithm tries to fit the request into one single 
machine. If this is not possible, it is divided into as many parts as possible 
thereby using up all those small partitions that would otherwise slow down 
subsequent asynchronous (or probably also synchronous) applications. This is a 
feasible approach siirce the commrmicatioir costs of synchroirous jobs depeird on 
the speed of the slowest iretwork coirnection. 

For the metacomputing scenario, we derived a variant called FCFS* that 
does irot choose t as the earliest possible startiirg time but as that time that 
assures the earliest possible fiirishiirg time. This differs from the original algo- 
rithm, because iir the metacomputiirg sceirario a later startiirg time can lead 
to an earlier result because it could possibly be mapped on less machines (see 
equations (4.3) and (5)). 

Although FCFS is a very simple algorithm, experience has shown that its 
results are usually not as bad as one could have expected [39]. 

FFIH First Fit Inereasing Height is a two-dimensional variant of the well known 
List Scheduling algorithm [34]. The basic idea of this approach is to sort all 
requests in order of increasing resource demands and then perform an FCFS 
scheduling with the sorted list. Strictly speaking, FFIH is an offline algorithm. 
However, if a waiting room is introduced [21], it can be used for oirline scheduliirg, 
too. Siirce FFIH is derived from FCFS, it shows similar behavior. However, it 
is typically better for minimizing the average response time, since small jobs 
are scheduled first and therefore the number of jobs with long waiting times is 
reduced. 

ASP Adaptive Static Partitioning is one of the most promising disciplines for 
scheduling moldable jobs on parallel machines. Although up to now it has only 
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rarely been used in practice, the algorithm is simple, easy to be implemented, 
and shows good performance on monolithic machines, if applied to real work- 
loads [29]. ASP stores incoming requests in a waiting room until there are free 
processors available. It then assigns as many requests to these processors as pos- 
sible. If the waiting room is sufficiently filled, this means that asynchronous (i.e. 
moldable) jobs may be scaled down to only a few processors in order to be able 
to start more jobs at the same time. For jobs with a high degree of parallelism 
this can lead to extremely long execution times. Hence, we introduced a slight 
modification of the original algorithm which we call ASP*. This strategy assigns 
at least aA processors to any job for which A processors were requested, whereby 
a is a constant of the algorithm with 0 < a < 1. 



5.2 Scheduling with No Control - The Reference Case 

Scheduling with no control models the way job management is currently done by 
most metacomputing environments. Local jobs are submitted directly to the ma- 
chines without notifying the metacomputer scheduler. Consequently, the sched- 
uler mainly concentrates on good mapping and partitioning strategies. Further- 
more, heuristics can be integrated to estimate the total workload on each machine 
of the metacomputer. In the following, we describe how the scheduling strategies 
were modified for this scenario. 

Since FFIH, ASP, and ASP* need information about the contents of the 
local queues (the waiting room is only drained when the local queues are at 
least partially empty), these algorithms are not directly applicable in a scenario 
where no information is available. Hence, we chose only the FCFS strategy for 
examination in such an environment. FCFS with no control over local queues 
first sores all n machines in random order in a vector (mo , . . . , m„). Then each 
request in the waiting queue is mapped onto machines , . . . , with 
S/Lq Size(mi) > Ar-, whereby Size(mi) gives the total amount of processors 
of machine I and Ar- gives the overall number of resources requested by r^. 
For any two jobs ri,rj with i < j the mapping is created in a way such that 
ii <l2 < ji < 32- 

Randomizing the machine vector for each scheduling round works as a simple 
heuristics to achieve balanced work distribution over the complete machine pool. 



5.3 Scheduling with Minimum Control - The Compromise 

Scheduling with minimum control is the compromise we propose as a feasible way 
to adapt scheduling algorithms designed for monolithic parallel machines to a 
metacomputing environment. The fundamental idea of this approach is to use the 
surfaces of all underlying machines for scheduling the metacomputer requests. 
Local jobs can still be submitted to the queuing systems directly. However, 
from time to time the metacomputer performs a scheduling round. When this 
happens, each machine informs the metacomputer about the current shape of 
its surface (or probably only a subset of the complete surface, depending on 
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local administration policies). During the short period while the metacomputer 
performs its scheduling, the local queuing systems must not alter the shape of 
their surfaces. This can for example be achieved by delaying jobs, that arrive 
during this time, in a special waiting room. 

This solution offers the freedom of the full control scenario during each 
scheduling round without imposing significant restrictions onto the local queu- 
ing systems. It is therefore a) feasible and b) a much better basis for advanced 
scheduling algorithms like for example those described in [29]. We call this tech- 
nique Interleaved Hybrid Scheduling or IHS for short, because each machine is 
alternatingly managed by two different schedulers (local and meta scheduler). A 
drawback of IHS is that the surfaces presented to the metacomputer scheduling 
algorithm are usually not flat but rather form a more or less steep staircase. 
Hence, many of the algorithms found in literature cannot be applied directly. 

The two possible solutions are either to schedule on top of a flat line placed 
above the highest step of the surface or to find an algorithm that can efficiently 
handle uneven surfaces. So far, we have only used the IHS technique with the al- 
gorithms described in Sec. 5.1. In the future, we are planning to search scheduling 
strategies that are better tuned to the IHS scenario. 

There exist several possible schemes for deciding, when the metacomputer 
shall trigger a new scheduling round. For this paper we have examined two 
different approaches. The first solution was to wait a fixed time interval At 
between the termination of the last metacomputing job of the previous round 
and the invocation of a new one. This solution is attractive for the machine 
owners, since it guarantees a fixed time interval during which their resource are 
not used by the metacomputer. For the results presented in Sec. 6 we used a 
value of At = 4 hours. 

The second and more flexible approach is to also consider the amount of 
resource requests that are pending in the waiting room of the metacomputer. 
In order to achieve this, we triggered a scheduling round of the metacomputer 
when 



— — 7 — 7- • Ar ■ Time(r,Ar) ■ Waited (r) ■ 7 

zp ( m I 

m^MachinePool V / r^WaitingRoom 

( 9 ) 

Where Time {r, Ar) is the expected execution time of r on Ar processors of a 
monolithic machine. Waited {r) is the amount of time units the request has so 
far been waiting, and 7 is a constant weight factor that determines, how much 
the overall decision is influenced by the waiting time. 



6 Results 

The following results were all obtained from a wide range of simulation runs 
using the model described in Sections 3 and 4. We measured the accumulated 
response time of the first c • M jobs that were completed by the simulated 
metacomputer, whereby M was the total number of machines in the pool and c 
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was set to 1000. As can be seen in Figures 5, 6, and 7, c = 1000 is not completely 
sufficient to eliminate all the effects of a randomized workload. Although these 
simulations were performed on the large parallel machines of the Paderborn 
Center for Parallel Computing, it took several weeks to collect the necessary 
data. Therefore, we believe that c = 1000 represents a good compromise between 
precision of the data and resources needed to obtain the results. 

It should be pointed out that jobs are arriving throughout the whole simula- 
tion run. This means that the queues are not being drained towards the end of 
simulation which would be an unrealistic scenario. However, as we will point out 
later, this does not leave the results unaffected and has to be considered, when 
the simulation data is being interpreted. 



32 machines, p_l=0.5, MppWanRatio=25.0, p_sync=0.5 




Fig. 5. Performance of different algorithms in the idealized scenario 



Fig. 5 depicts the simulation results for algorithms running in the idealized 
scenario that does not allow direct submission of jobs to local queues (full con- 
trol). In this diagram compression ranges up to a value of 19, meaning that within 
a certain time interval there are 19 times more job arrivals than normal. The 
metacomputer contained 32 machines, Internet communication was assumed to 
be 25 times slower than communication within the parallel machines, the ratio 
between synchronous and asynchronous workload was 1:1, and there were as 
many local jobs as jobs submitted through the metacomputer. 
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First of all it can be observed that all curves are converging for increasing 
load situations. This is due to the fact that the job queues were not drained 
towards the end of a simulation run. Thus, if the arrival rate reaches a certain 
algorithm dependent threshold, a large fraction of the jobs remains in the queues 
and has no or only marginal effect on the performance of the algorithm. 

Furthermore, it is remarkable that the ASP-based algorithms show the weak- 
est performance. This is due to the fact that ASP has a tendency to allocate very 
few processors even to large jobs. This effect is a little bit reduced by the ASP* 
variant. However, since ASP* becomes identical to FCFS if a gets close to 1, 
we have chosen a = 0.5. This was not large enough to prevent the algorithm 
from creating extremely long running jobs and thereby preventing the start of 
large synchronous (i.e. rigid) jobs for a long time. We suppose that ASP would 
perform better, if the job mix did only contain asynchronous jobs. 

FFIH, on the other hand, demonstrates good results in our simulation envi- 
ronment. However, the plots indicate that this strategy is very sensitive towards 
the job sequence to be scheduled. The reason for this becomes clear, if we imag- 
ine the worst possible case. This is an alternating sequence of jobs that either 
use few processors and run for a long time or jobs that block the whole meta- 
computer but can be finished almost instantly. Since these requests have similar 
resource demands such sequences are likely to be created by the sorting step of 
FFIH. 



32 machines, p_l=0.5, MppWanRatio=25.0, p_sync=0.5 




Fig. 6. Performance of different algorithms in a real world scenario 
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The results presented in Fig. 6 indicate what can be achieved in a real world 
scenario, if the IHS technique is applied. It can be seen that for smaller and 
therefore more realistic compression values all IHS variants show significantly 
better performance than an FCFS algorithm that uses no information about the 
local queues. For these load values the FFIH and FCFS strategies even seem to be 
better then they are in the scenario with full control over the local queues. This 
is caused by the fact that in the idealized scenario there are no locally submitted 
jobs, which are restricted to the size of the corresponding machines and therefore 
are smaller than many of the metacomputing jobs. As a consequence, care has 
to be taken when comparing the plots of Figures 5 and 6. We think that the 
slight difference in the incoming job mix corresponds to the users’ behavior, but 
this assumption still has to be proven. 

At a first glance, it is astonishing that the ASP algorithm seems to be the 
best strategy for the real world scenario. Having a closer look at Fig. 7, it can be 
seen that the result plots for ASP in Fig. 6 are mainly dominated by the response 
times of those jobs submitted by local users. In other words, the execution times 
of metacomputing jobs became so long that only few of them terminated among 
the first c • M jobs. Hence the good results of APS have been achieved by 
providing a much worse quality of service for multi-site applications. Hence, for 
scheduling a metacomputer we propose to use IHS either with FCFS* or with 
FFIH. The latter is best for moderate workloads while FCFS tends to behave 
slightly better, when the system is extremely heavy loaded. 



7 Conclusion 

Our motivation was - and still is - to find the best possible scheduling algorithm 
for the NRW metacomputer. Looking at what was available in the literature 
exposed a large gap between algorithms that have been studied analytically 
and those found in existing environments. Therefore, we extended the model of 
Feitelson and Rudolph towards metacomputer scheduling in order to obtain a 
tool for measuring the quality of different algorithms. Care was taken to derive a 
model that reflects the real world scenario as close as possible. Only in the second 
line, we tried to keep it simple enough for analytical studies. We think, that our 
model is still close enough to the proposal of Feitelson and Rudolph to allow 
comparison with other results obtained from the same approach. Much effort has 
been spent on the implementation of this model. Therefore, we tried to create 
a generic simulation environment, which we are making publicly available. This 
may also be useful to researchers, who study scheduling of monolithic parallel 
machines, since a metacomputer can be seen as a generalization of this concept. 

We found that an important factor for metacomputer scheduling is the exis- 
tence of local queues to which jobs are submitted without being passed through 
the metacomputer system. Hence, we developed the IHS technique as a feasible 
approach to use well known scheduling algorithms for the prototype of a work- 
ing metacomputer. Our results indicate that for moderate workloads use of the 
IHS technique with the FFIH algorithm decreases the average response times 
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32 machines, p_l=0.5, MppWanRatio=25.0, p_sync=0.5 




Fig. 7. Average response time of jobs that were submitted locally 



significantly. However, more effort should be spent on examining more powerful 
algorithms with IHS. 

Another important aspect of metacomputer scheduling that was not dealt 
with in this paper is the reliability of job descriptions. So far we assumed that 
everything the scheduler is being told about a job is absolutely precise. However, 
in reality this is usually not the case. Hence, in our future work, we plan to pay 
closer attention to the effects of unreliable information on scheduling algorithms 
for metacomputing. 
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Abstract. On many computers, a request to run a job is not serviced 
immediately but instead is placed in a queue and serviced only when 
resources are released by preceding jobs. In this paper, we build on run- 
time prediction techniques that we developed in previous research to 
explore two problems. The first problem is to predict how long applica- 
tions will wait in a queue until they receive resources. We develop run- 
time estimates that result in more accurate wait-time predictions than 
other run-time prediction techniques. The second problem we investi- 
gate is improving scheduling performance. We use run-time predictions 
to improve the performance of the least- work-first and backhll scheduling 
algorithms. We Hnd that using our run-time predictor results in lower 
mean wait times for the workloads with higher offered loads and for the 
backfill scheduling algorithm. 



1 Introduction 

On many high-performance computers, a request to execute an application is not 
serviced immediately but instead is placed in a queue and serviced only when 
resources are released by running applications. We examine two separate prob- 
lems in this environment. First, we predict how long applications will wait until 
they execute. These estimates of queue wait times are useful to guide resource 
selection when several systems are available [7], to co-allocate resources from 
multiple systems [2], to schedule other activities, and so forth. Our technique 
for predicting queue wait times is to use predictions of application execution 
times along with the scheduling algorithms to simulate the actions made by a 
scheduler and determine when applications will begin to execute. 

We performed queue wait-time prediction and scheduling experiments using 
four workloads and three scheduling algorithms. The workloads were recorded 
from an IBM SP at Argonne National Laboratory, an IBM SP at the Cornell 
Theory Center, and an Intel Paragon at the San Diego Supercomputing Center. 
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The scheduling algorithms are first-come first-served (FCFS), least- work-first 
(LWF), and backfill. We find that there is a built-in error of 34 to 43 percent 
when predicting wait times of the LWF algorithm and a smaller built-in error of 
3 to 4 percent for the backfill algorithm. We also find that more accurate run- 
time predictions result in more accurate wait-time predictions. Specifically, using 
our run-time prediction technique instead of maximum run times or the run-time 
prediction techniques of Gibbons [8] or Downey [3] improves run-time prediction 
error by 39 to 92 percent and this improves wait-time prediction performance 
by 16 to 89 percent. 

Second, we improve the performance of the LWF and backfill scheduling al- 
gorithms by using our run-time predictions. These algorithms use run-time pre- 
dictions when making scheduling decisions, and we therefore expect that more 
accurate run-time predictions will improve scheduling performance. Using our 
run-time predictions using the same four workloads describe above, we find that 
the accuracy of the run-time predictions has a minimal effect on the utilization 
of the systems we are simulating. We also find that using our run-time predictors 
result in mean wait times that are within 22 percent of the mean wait times that 
are obtained if the scheduler exactly knows the run times of all of the applica- 
tions. When comparing the different predictors, our run-time predictor results 
in 2 to 67 percent smaller mean wait times for the workload with the highest 
offered load. No prediction technique clearly outperforms the other techniques 
when the offered load is low. 

The next section summarizes our approach to predicting application run 
times and the approaches of other researchers. Section 3 describes our queue 
wait-time prediction technique and presents performance data. Section 4 presents 
the results of using our run-time predictions in the LWF and backfill scheduling 
algorithms. Section 5 presents our conclusions. 

2 Predicting Application Run Times 

This section briefly describes our prediction technique, discusses the scheduling 
algorithms used with this work, and reviews the run-time prediction techniques 
of two other researchers. Gibbons [8] and Downey [3]. For further details, see [12]. 

2.1 Our Run-Time Prediction Technique 

Our general approach to predicting application run times is to derive run-time 
predictions from historical information of previous “similar” runs. This approach 
is based on the observation [12,3,5,8] that similar applications are more likely 
to have similar run times than applications that have nothing in common. We 
address the issues of how to define similar and how to generate predictions from 
similar past applications. 

Defining Similarity A difficulty in developing prediction techniques based on 
similarity is that two jobs can be compared in many ways. For example, we can 
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compare the application name, submitting user name, executable arguments, 
submission time, and number of nodes requested. In this work, we are restricted 
to those values recorded in workload traces obtained from various supercom- 
puter centers. However, because the techniques that we propose are based on 
the automatic discovery of efficient similarity criteria, we believe that they will 
apply even if quite different information is available. 



Table 1. Characteristics of the trace data used in our studies. 



Workload 

Name 


System 


Number 

of 

Nodes 


Location 


Number 

of 

Requests 


Mean 
Run Time 
(minutes) 


ANL " 


IBM SP2 


120 


ANL 


7994 


97.75 


CTC 


IBM SP2 


512 


CTC 


13217 


171.14 


SDSC95 


Intel Paragon 


400 


SDSC 


22885 


108.21 


SDSC96 


Intel Paragon 


400 


SDSC 


22337 


166.98 



Table 2. Characteristics recorded in workloads. The column “Abbr” indicates 
abbreviations used in subsequent discussion. 





Abbr 


Characteristic 


Argonne 


Cornell 


SDSC 


1 


t 


Type 


batch, 

interactive 


serial, 

parallel, 

pvm3 




2 


q 


Queue 






29 to 
35 queues 


3 


C 


Class 




DSI/PIOFS 




4 


U 


User 


Y 


Y 


Y 


5 


s 


Loadleveler script 




Y 




6 


e 


Executable 


Y 






7 


a 


Arguments 


Y 






8 


na 


Network adaptor 




Y 




9 


n 


Number of nodes 


Y 


Y 


Y 


10 




Maximum run time 


Y 


Y 




11 




Submission time 


Y 


Y 


Y 


12 




Start time 


Y 


Y 


Y 


13 




Run time 


Y 


Y 


Y 



^ Because of an error when the trace was recorded, the ANL trace does not include 
one-third of the requests actually made to the system. To compensate, we reduced 
the number of nodes on the machine from 120 to 80 when performing simulations. 
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The workload traces that we consider are described in Table 1; they originate 
from Argonne National Laboratory (ANL), the Cornell Theory Center (CTC), 
and the San Diego Supercomputer Center (SDSC). Table 2 summarizes the in- 
formation provided in these traces. Text in a field indicates that a particular 
trace contains the information in question; in the case of “Type,” “Queue,” or 
“Class” the text specifies the categories in question. 

The general approach to defining similarity that we as well as Downey and 
Gibbons take is to use characteristics such as those presented in Table 2 to 
define templates that identify a set of eategories to which jobs can be assigned. 
For example, the template (q,u) specifies that jobs are to be partitioned by 
queue and user] on the SDSC Paragon, this template generates categories such 
as (ql6m,wsmith), (q641 ,wsmith.) , and (ql6m, foster) . 

We find that using characteristics 1-6 of Table 2 in the manner just described 
works reasonably well. On the other hand, the number of nodes is an essentially 
continuous parameter, and so we prefer to introduce an additional parameter 
into our templates, namely, a “node range size” that defines what ranges of 
requested number of nodes are used to decide whether applications are similar. 
For example, the template (u , n=4) specifies a node range size of 4 and generates 
categories (wsmith, 1-4 nodes) and (wsmith, 5-8 nodes). 

Once a set of templates has been defined (using a search process described 
later), we can categorize a set of applications (e.g., the workloads of Table 1) 
by assigning each application to those categories that match its characteristics. 
Categories need not be disjoint, and hence the same job can occur in several 
categories. If two jobs fall into the same category, they are judged similar; those 
that do not coincide in any category are judged dissimilar. 



Generating Predictions We now consider the question of how we generate 
run-time predictions. The input to this process is a set of templates and a work- 
load for which run-time predictions are required. In addition to the character- 
istics described in the preceding section, running time, maximum history, type 
of data to store, and prediction type are also defined for each template. The 
running time is how long an application has been running when a prediction is 
made; we use this characteristic by forming a prediction from a category using 
only the data points that have an execution time less than this running time. 
The maximum history indicates the maximum number of data points to store in 
each category generated from a template. The type of data is either an actual run 
time or a relative run time. A relative run time incorporates information about 
user-supplied run time estimates by storing the ratio of the actual run time to 
the user-supplied estimate (the maximum run times provided by the ANL and 
CTC workloads). The prediction type determines how a run-time prediction is 
made from the data in each category generated from a template. In our previous 
work, we considered four prediction types: a mean, a linear regression, an inverse 
regression, and a logarithmic regression [13,4]. We found that the mean is the 
single best predictor, so this study uses only the mean to form predictions. 
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The output from this process is a set of run-time predictions and associated 
confidence intervals. (A confidence interval is an interval centered on the run- 
time prediction within which the actual run time is expected to appear some 
specified percentage of the time.) The basic algorithm is described below and 
comprises three steps: initialization, prediction, and incorporation of historical 
information. 

1. Define T, the set of templates to be used, and initialize C, the (initially 

empty) set of categories. 

2. At the time each application a begins to execute: 

(a) Apply the templates in T to the characteristics of a to identify the cat- 
egories Ca into which the application may fall. 

(b) Eliminate from Ca all categories that are not in C or that cannot provide 
a valid prediction (i.e., do not have enough data points). 

(c) For each category remaining in Ca^ compute a run-time estimate and a 
confidence interval for the estimate. 

(d) If Ca is not empty, select the estimate with the smallest confidence in- 
terval as the run-time prediction for the application. 

3. At the time each application a completes execution: 

(a) Identify the set Ca of categories into which the application falls. These 
categories may or may not exist in C . 

(b) For each category a G Ca 

i. If Ci ^ C, create Ci in C . 

ii. If \ci\ = maximum history(ci), remove the oldest point in c^. 

iii. Insert a into c^. 

Note that steps 2 and 3 operate asynchronously, since historical information 
for a job cannot be incorporated until the job finishes. Hence, our algorithm 
suffers from an initial ramp-up during which there is insufficient information in 
C to make predictions. This deficiency could be corrected by using a training 
set to initialize C. 

The use of maximum histories in step 3(b) of our algorithm allows us to 
control the amount of historical information used when making predictions and 
the amount of storage space needed to store historical information. A small 
maximum history means that less historical information is stored, and hence 
only more recent events are used to make predictions. 



Template Definition and Search We use search techniques to identify good 
templates for a particular workload; this approach is in contrast to the strategies 
of Gibbons and Downey, who use a fixed set of templates. While the number of 
application characteristics included in our traces is relatively small, the fact that 
effective template sets may contain many templates means that an exhaustive 
search is impractical. Our previous work compared greedy and genetic algorithm 
searches and found that genetic algorithm searches outperform greedy searches. 
Therefore, we use only genetic algorithm searches in this work. 
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Genetic algorithms are a probabilistic technique for exploring large search 
spaces, in which the concept of cross-over from biology is used to improve effi- 
ciency relative to purely random search [10]. A genetic algorithm evolves indi- 
viduals over a series of generations. The process for each generation consists of 
evaluating the fitness of each individual in the population, selecting which indi- 
viduals will be mated to produce the next generation, mating the individuals, 
and mutating the resulting individuals to produce the next generation. The pro- 
cess then repeats until a stopping condition is met. The stopping condition we 
use is that a fixed number of generations have been processed. There are many 
different variations to this process, and we will next describe the variations we 
used. 

Our individuals represent template sets. Each template set consists of be- 
tween 1 and 10 templates, and we encode the following information in binary 
form for each template: 

1 . Whether a mean or one of the three regressions is used to produce a predic- 
tion. 

2. Whether absolute or relative run times are used. 

3. Whether each of the binary characteristics associated with the workload in 
question is enabled. 

4. Whether node information should be used and, if so, the range size from 1 
to 512 in powers of 2. 

5. Whether the amount of history stored in each category should be limited 
and, if so, the limit between 2 and 65536 in powers of 2. 



A fitness function is used to compute the fitness of each individual and there- 
fore its chance to reproduce. The fitness function should be selected so that the 
most desirable individuals have higher fitness and produce more offspring, but 
the diversity of the population must be maintained by not giving the best in- 
dividuals overwhelming representation in succeeding generations. In our genetic 
algorithm, we wish to minimize the prediction error and maintain a range of 
individual fitnesses regardless of whether the range in errors is large or small. 
The fitness function we use to accomplish this goal is 



Frr 



r-E 



c-E„ 



^ {,Fmax 



where E is the error of the individual (template set), Emin and Emax are 
the minimum and maximum errors of individuals in the generation, and Emin 
and Emax are the desired minimum and maximum fitnesses desired. We choose 
F — AF 

max — min • 



We use a common technique called stochiastic sampling with replacement 
to select which individuals will mate to produce the next generation. In this 
technique, each parent is selected from the individuals by selecting individual i 
with probability . 

The mating or crossover process is accomplished by randomly selecting pairs 
of individuals to mate and replacing each pair by their children in the new pop- 
ulation. The crossover of two individuals proceeds in a slightly nonstandard way 
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because our chromosomes are not fixed length but a multiple of the number 
of bits used to represent each template. Two children are produced from each 
crossover by randomly selecting a template i and a position p in the template 
from the first individual T\ = ti,i, . . . and randomly selecting a template 
j in the second individual T2 = t2,i, ■ ■ • , t2,m so that the resulting individu- 
als will not have more than 10 templates. The new individuals are then Ti = 
^l,li ■ • • I ni, t 2 j + l, ■ ■ • , t 2 ,m and T2 = t 2 ,l ■ • ■ ^2,^-1, ^2, tiy+l, . . . , If 

there are b bits used to represent each template, ni is the first p bits of 
concatenated with the last b — p bits of t2j, and ri2 is the first p bits of t2j 
concatenated with the last b — p bits of try. 

In addition to using crossover to produce the individuals of the next gener- 
ation, we also use a process called elitism whereby the best individuals in each 
generation survive unmutated to the next generation. We use crossover to pro- 
duce all but two individuals for each new generation and use elitism to select 
the last two individuals for each new generation. The individuals resulting from 
the crossover process are mutated to help maintain a diversity in the population. 
Each bit representing the individuals is flipped with a probability of .01. 



Run-Time Prediction Experiments We use run time predictions to pre- 
dict queue wait times and improve the performance of scheduling algorithms. 
Therefore, we need to determine what workloads to search over to find the best 
template sets to use. We have already described four sets of trace data that 
were recorded from supercomputers. Next, we describe the three scheduling al- 
gorithms we consider. 

We use the first-come first-served (FCFS), least- work-first (LWF), and back- 
fill scheduling algorithms in this work. In the FCFS algorithm, applications are 
given resources in the order in which they arrive. The application at the head of 
the queue runs whenever enough nodes become free. The LWF algorithm also 
tries to execute applications in order, but the applications are ordered in increas- 
ing order using estimates of the amount of work (number of nodes multiplied by 
estimated wallclock execution time) the application will perform. 

The backfill algorithm is a variant of the FCFS algorithm. The difference 
is that the backfill algorithm allows an application to run before it would in 
FCFS order if it will not delay the execution of applications ahead of it in 
the queue (those that arrived before it). When the backfill algorithm tries to 
schedule applications, it examines every application in the queue, in order of 
arrival time. If an application can run (there are enough free nodes and running 
the application will not delay the starting times of applications ahead of it in the 
queue), it is started. If an application cannot run, nodes are “reserved” for it at 
the earliest possible time. This reservation is only to make sure that applications 
behind it in the queue do not delay it; the application may actually start before 
the reservation time. 

Each scheduling algorithm predicts application run times at different times 
when predicting queue wait times for the jobs in each trace. When predicting 
queue wait times, we predict the wait time of an application when it is sub- 
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mitted. A wait-time prediction in this case requires run-time predictions of all 
applications in the system so the run-time prediction workload contains predic- 
tions for all running and queued jobs every time an application is submitted. 
We insert data points for an application into our historical database as soon as 
each application completes. To try to find the optimal template set to use to 
predict execution times, we use a workload for each algorithm/trace pair and 
search over each of these 12 workloads separately. 

When using run-time predictions while scheduling, run-time predictions are 
also made at different times for each algorithm/trace pair, and we attempt to 
find the optimal template sets to use for each pair. The FCFS algorithm does 
not use run-time predictions when scheduling, so we only consider the LWF and 
backfill algorithms here. For the LWF algorithm, all waiting applications are 
predicted whenever the scheduling algorithm attempts to start an application 
(when any application is enqueued or finishes). This occurs because the LWF 
algorithm needs to find the waiting application that will use the least work. 
For the backfill algorithm, all running and waiting applications are predicted 
whenever the scheduling algorithm attempts to start an application (when any 
application is enqueued or finishes). 

We generate our run-time prediction workloads for scheduling using maxi- 
mum run times as run-time predictions. We note that using maximum run times 
will produce predictions and insertions slightly different from those produced 
when the LWF and backfill algorithms use other run-time predictions. Never- 
theless, we believe that these run-time prediction workloads are representative 
of the predictions and insertions that will be made when scheduling using other 
run-time predictors. 

2.2 Related Work 

Gibbons [8,9] also uses historical information to predict the run times of parallel 
applications. His technique differs from ours principally in that he uses a fixed 
set of templates and different characteristics to define templates. He uses the six 
templates/predictor combinations listed in Table 3. The running time (rtime) 
characteristic indicates how long an application has been executing when a pre- 
diction is made for the application. Gibbons produces predictions by examining 
categories derived from the templates listed in Table 3, in the order listed, until 
a category that can provide a valid prediction is found. This prediction is then 
used as the run-time prediction. 

The set of templates listed in Table 3 results because Gibbons uses templates 
of (u,e), (e), and 0 with subtemplates in each template. The subtemplates 
add the characteristics n and rtime. Gibbons also uses the requested number 
of nodes slightly differently from the way we do: rather than having equal-sized 
ranges specified by a parameter, as we do, he defines the fixed set of exponential 
ranges 1, 2-3, 4-7, 8-15, and so on. 

Another difference between Gibbons’s technique and ours is how he performs 
a linear regression on the data in the categories (u,e), (e), and (). These 
categories are used only if one of their subcategories cannot provide a valid 
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Table 3. Templates used by Gibbons for run-time prediction. 



Number 


Template 


Predictor 


1 


(u,e,n,rtime) 


mean 


2 


(u,e) 


linear regression 


3 


(e,n,rtime) 


mean 


4 


(e) 


linear regression 


5 


(n,rtime) 


mean 


6 


0 


linear regression 



prediction. A weighted linear regression is performed on the mean number of 
nodes and the mean run time of each subcategory that contains data, with each 
pair weighted by the inverse of the variance of the run times in their subcategory. 

Downey [3] uses a different technique to predict the execution time of parallel 
applications. His procedure is to categorize all applications in the workload, then 
model the cumulative distribution functions of the run times in each category, 
and finally use these functions to predict application run times. Downey catego- 
rizes applications using the queues that applications are submitted to, although 
he does state that other characteristics can be used in this categorization. In fact, 
Downey’s prediction technique within a category can be used with our technique 
for finding the best characteristics to use to categorize applications. 

Downey observed that the cumulative distributions of the execution times 
of the jobs in the workloads he examined can be modeled relatively accurately 
by using a logarithmic function: /3 q + (3\ Int. Once the distribution functions are 
calculated, he uses two different techniques to produce a run-time prediction. 
The first technique uses the median lifetime given that an application has exe- 
cuted for a time units. If one assumes the logarithmic model for the cumulative 
distribution, this equation is 

/ i.o-/io 

V ae~^i . 

The second technique uses the conditional average lifetime 

iniax C. 

log 

^max - logo 



with tmax = 

3 Predicting Queue Wait Times 

We use the run-time predictions described in the preceding section to predict 
queue wait times. Our technique is to perform a scheduling simulation using the 
predicted run times as the run times of the applications. This will then provide 
predictions of when applications will start to execute. Specifically, we simulate 
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the FCFS, LWF, and backfill scheduling algorithms and predict the wait time 
for each application as it is submitted to the scheduler. The accuracy of using 
various run-time predictors is shown in Table 4 through Table 9. 

Table 4 shows the wait-time prediction performance when actual run times 
are used during prediction. No data is shown for the FCFS algorithm because 
there is no error when computing wait-time predictors in this case: later-arriving 
jobs do not affect the start times of the jobs that are currently in the queue. 
For the LWF and backfill scheduling algorithms, wait-time prediction error does 
occur because later arriving jobs can affect when the jobs currently in the queue 
can run. As one can see in the table, the wait-time prediction error for the LWF 
algorithm is between 34 and 43 percent. For the backfill scheduling algorithm, 
there is a smaller error of 3 to 4 percent. This error is higher for the LWF 
algorithm because later arriving jobs that wish to perform smaller amounts 
of work move to the head of the queue. Any error for the backfill algorithm 
seems unexpected at first, but errors in wait-time prediction can occur because 
scheduling is performed using maximum run times. For example, a job J 2 arriving 
in the queue can start ahead of an already queued job Ji because the scheduler 
does not believe job Ji can use those nodes; a running job finishes early, and job 
Ji would start except that the job J 2 is using nodes that it needs. This example 
results in a wait-time prediction error for job J\ before job J 2 arrives in the 
queue. 

Table 4. Wait-time prediction performance using actual run times. 



Workload 


Scheduling 

Algorithm 


Mean Error 
(minutes) 


Percentage of 
Mean Wait Time 


ANL 


LWF 


37.14 


43 


ANL 


Backfill 


5.84 


3 


CTC 


LWF 


4.05 


39 


CTC 


Backfill 


2.62 


10 


SDSC95 


LWF 


5.83 


39 


SDSC95 


Backfill 


1.12 


4 


SDSC96 


LWF 


3.32 


42 


SDSC96 


Backfill 


0.30 


3 



Table 5 shows the wait-time prediction errors while using maximum run times 
as run-time predictions. Maximum run times are used to predict run times in 
scheduling systems such as EASY [11]. These predictions are provided in the 
ANL and CTC workload and are implied in the SDSC workloads because each 
of the queues in the two SDSC workload has maximum limits on resource usage. 
To derive maximum run times for the SDSC workloads, we determine the longest 
running job in each queue and use that as the maximum run time for all jobs in 
that queue. The wait-time prediction error when using actual run times as run- 
time predictors is 59 to 99 percent better than the wait-time prediction error of 
the LWF and backfill algorithms when using maximum run times as the run-time 
predictor. 
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Table 5. Wait-time prediction performance using maximum run times. 



Workload 


Scheduling 

Algorithm 


Mean Error 
(minutes) 


Percentage of 
Mean Wait Time 


ANL 


FCFS 


996.67 


186 


ANL 


LWF 


97.12 


112 


ANL 


Backfill 


429.05 


242 


CTC 


FCFS 


125.36 


128 


CTC 


LWF 


9.86 


94 


CTC 


Backfill 


51.16 


190 


SDSC95 


FCFS 


162.72 


295 


SDSC95 


LWF 


28.56 


191 


SDSC95 


Backfill 


93.81 


333 


SDSC96 


FCFS 


47.83 


288 


SDSC96 


LWF 


14.19 


180 


SDSC96 


Backfill 


39.66 


350 



Table 6 shows that our run-time prediction technique results in wait-time 
prediction errors that are from 34 to 77 percent of mean wait times. Our data 
also shows that run-time prediction errors that are from 33 to 73 percent of 
mean application run times. The best wait-time prediction performance occurs 
for the ANL and CTC workloads and the worst for the SDSC96 workload. This 
is the opposite of what we expect from the run-time prediction errors. The 
most accurate run-time predictions are for the SDSC96 workload and the least 
accurate are for the CTC workload. The results imply that accurate run-time 
predictions are not the only factor that determines the accuracy of wait-time 
predictions. 

The results when using our run-time predictor also show that the mean wait 
time prediction error is 20 percent better to 62 percent worse than when pre- 
dicting wait times for the LWF algorithm using actual run times. Finally, using 
our run-time predictor results in 42 to 88 percent better wait time predictions 
than when using maximum run times as the run-time predictors. 

Table 7 shows the wait-time prediction errors when using Gibbons’s run 
time predictor. Our run-time prediction errors are between 39 and 68 percent 
better than Gibbons’s and our wait-time prediction errors are between 13 and 
83 percent better. 

Tables 8 and 9 show the wait-time prediction error when using Downey’s 
conditional average and conditional median predictors. The wait-time prediction 
errors we achieve when using our run-time predictor are between 19 and 87 
better than these errors and our run-time prediction errors are between 42 and 
92 percent better. 

In summary, the results show that our run-time predictor is more accurate 
than maximum run times, Gibbons’s predictor, or Downey’s predictors. The 
results also show that the wait-time prediction errors are smaller when our run- 
time predictor is used. Glearly, there is a correlation between wait-time prediction 
error and run-time prediction error. 
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Table 6. Wait-time prediction performance using our run-time predictor. 



Workload 


Scheduling 

Algorithm 


Mean Error 
(minutes) 


Percentage of 
Mean Wait Time 


ANL 


FCFS 


161.49 


30 


ANL 


LWF 


44.75 


51 


ANL 


Backfill 


75.55 


43 


CTC 


FCFS 


30.84 


31 


CTC 


LWF 


5.74 


55 


CTC 


Backfill 


11.37 


42 


SDSC95 


FCFS 


20.34 


37 


SDSC95 


LWF 


8.72 


58 


SDSC95 


Backfill 


12.49 


44 


SDSC96 


FCFS 


9.74 


59 


SDSC96 


LWF 


4.66 


59 


SDSC96 


Backfill 


5.03 


44 



4 Improving Scheduler Performance 

Our second application of run-time predictions is to improve the performance 
of the LWF and backfill scheduling algorithms. Table 10 shows the performance 
of the scheduling algorithms when the actual run times are used as run-time 
predictors. This is the best performance we can expect in each case and serves 
as an upper bound on scheduling performance. 

Table 11 shows the performance of using maximum run times as run time 
predictions in terms of average utilization and mean wait time. The scheduling 
performance when using the maximum run times can once again be considered an 
upper bound for comparison. When comparing this data to the data in Table 10, 
one can see that the maximum run times are an inaccurate predictor but this 
fact does not affect the utilization of the simulated parallel computers. Predicting 
run times with actual run-times when scheduling results in 3 to 27 percent lower 
mean wait times, except in one case where using maximum run times results in 
6 percent lower mean wait times. The effect of accurate run-time predictions is 
highest for the ANL workload which has the largest offered load. 

Table 12 shows the performance of using our run-time prediction technique 
when scheduling. The run-time prediction error in this case is between 23 and 93 
percent of mean run times, slightly worse than the results when predicting run- 
times for wait-time prediction. This worse performance is due to more predictions 
being performed. First, more predictions are made of applications before they 
begin executing; and these predictions do not have information about how long 
an application has executed. Second, more predictions are made of long-running 
applications, the applications that contribute the largest errors to the mean 
errors. 

Our run-time prediction technique results in mean wait times that are 5 
percent better to 4 percent worse than when using actual run times as predictions 
for the least-work-first algorithm. For the backfill algorithm, mean wait times 
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Table 7. Wait-time prediction performance using Gibbons’s run-time predictor. 



Workload 


Scheduling 

Algorithm 


Mean Error 
(minutes) 


Percentage of 
Mean Wait Time 


ANL 


FCFS 


350.86 


66 


ANL 


LWF 


76.23 


91 


ANL 


Backfill 


94.01 


53 


CTC 


FCFS 


81.45 


83 


CTC 


LWF 


32.34 


309 


CTC 


Backfill 


13.57 


50 


SDSC95 


FCFS 


54.37 


99 


SDSC95 


LWF 


11.60 


78 


SDSC95 


Backfill 


20.27 


72 


SDSC96 


FCFS 


22.36 


135 


SDSC96 


LWF 


6.88 


87 


SDSC96 


Backfill 


17.31 


153 



when using our run-time predictor are 11 to 22 percent worse. These results can 
be understood by noticing that the backfill algorithm requires more accurate run- 
time predictions than LWF. LWF just needs to know if applications are “big” or 
“small,” and small errors do not greatly affect performance. The performance of 
the backfill algorithm depends on accurate run-time predictions because it tries 
to fit applications into time/space slots. 

When comparing our run-time prediction technique to using maximum run 
times, our technique has a minimal effect on the utilization of the systems, but 
it does decrease the mean wait time in six of the eight experiments. Table 13 
through Table 15 show the performance of the scheduling algorithms when using 
Gibbons’s and Downey’s run-time predictors. The results indicate that once 
again, using our run-time predictor does not produce greater utilization. The 
results also show that our run-time predictor results in 13 to 50 percent lower 
mean wait times for the ANL workload, but there is no clearly better run-time 
predictor for the other three workloads. The ANL workload has much larger 
mean wait times and higher utilization (greater offered load) than the other 
workloads (particularly the SDSG workloads). This may indicate that greater 
prediction accuracy of our technique when scheduling becomes “hard” . To test 
this hypothesis, we compressed the interrival time of applications by a factor of 
two for both SDSG workloads and then simulated these two new workloads. We 
found that our run-time predictor results in mean wait times that are 8 percent 
better on average, but are 43 percent lower to 31 percent higher than mean wait 
times than obtained when using Gibbons’s or Downey’s techniques. 

The results also show that of Gibbons’s and Downey’s run-time predictors, 
Downey’s conditional average is the worst predictor and Gibbons’s predictor is 
the most accurate. The data shows that our run-time predictor is between 2 and 
86 percent better than the other predictors, except for the GTG workload. For 
this workload, our predictor is the worst. This may be explained by the limited 
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Table 8. Wait-time prediction performance using Downey’s conditional average 
run-time predictor. 



Workload 


Scheduling 

Algorithm 


Mean Error 
(minutes) 


Percentage of 
Mean Wait Time 


ANL 


FCFS 


443.45 


83 


ANL 


LWF 


232.24 


277 


ANL 


Backfill 


339.10 


191 


CTC 


FCFS 


65.22 


66 


CTC 


LWF 


14.78 


141 


CTC 


Backfill 


17.22 


64 


SDSC95 


FCFS 


187.73 


340 


SDSC95 


LWF 


35.84 


240 


SDSC95 


Backfill 


62.96 


223 


SDSC96 


FCFS 


83.62 


503 


SDSC96 


LWF 


28.42 


361 


SDSC96 


Backfill 


47.11 


415 



template searches we performed for that workload (because of time constraints) . 
The accuracy of the run-time predictions for the CTC workload carriers over to 
the to the mean wait times of the scheduling algorithms when using the various 
run-time predictors: our mean wait times are the worst. 



5 Conclusions 

In this work, we apply predictions of application run times to two separate 
scheduling problems. The problems are predicting how long applications will 
wait in queues before executing and improving the performance of scheduling 
algorithms. Our technique for predicting application run times is to derive a 
prediction for an application from the run times of previous applications judged 
similar by a template of key job characteristics. The novelty of our approach 
lies in the use of search techniques to find the best templates. For the workloads 
considered in this work, our searches found templates that result in run-time 
prediction errors that are significantly better than those of other researchers or 
using user-supplied maximum run times. 

We predict queue wait times by using run-time predictions and the algo- 
rithms used by schedulers. These two factors are used to simulate scheduling 
algorithms and decide when applications will execute. Estimates of queue wait 
times are useful to guide resource selection when several systems are available, 
to co-allocate resources from multiple systems, to schedule other activities, and 
so forth. This technique results in a wait-time prediction error of between 33 
and 73 percent of mean wait times when using our run-time predictors. This 
error is significantly better than when using the run-time predictors of Gibbons, 
Downey, or user-supplied maximum run times. We also find that even if we pre- 
dict application run times with no error, the wait-time prediction error for the 
least- work- first algorithm is significant (34 to 43 percent of mean wait times). 
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Table 9. Wait-time prediction performance using Downey’s conditional median 
run-time predictor. 



Workload 


Scheduling 

Algorithm 


Mean Error 
(minutes) 


Percentage of 
Mean Wait Time 


ANL 


FCFS 


534.71 


100 


ANL 


LWF 


254.91 


304 


ANL 


Backfill 


410.57 


232 


CTC 


FCFS 


83.33 


85 


CTC 


LWF 


15.47 


148 


CTC 


Backfill 


19.35 


72 


SDSC95 


FCFS 


62.67 


114 


SDSC95 


LWF 


18.28 


122 


SDSC95 


Backfill 


27.52 


98 


SDSC96 


FCFS 


34.23 


206 


SDSC96 


LWF 


12.65 


161 


SDSC96 


Backfill 


20.70 


183 



Table 10. Scheduling performance using actual run times. 



Workload 


Scheduling 

Algorithm 


Utilization 

(percent) 


Mean Wait Time 
(minutes) 


ANL 


LWF 


70.34 


61.20 


ANL 


Backfill 


71.04 


142.45 


CTC 


LWF 


51.28 


11.15 


CTC 


Backfill 


51.28 


23.75 


SDSC95 


LWF 


41.14 


14.48 


SDSC95 


Backfill 


41.14 


21.98 


SDSC96 


LWF 


46.79 


6.80 


SDSC96 


Backfill 


46.79 


10.42 



We improve the performance of the least-work-first and backfill scheduling 
algorithms by using our run-time predictions when scheduling. We find that the 
utilization of the parallel computers we simulate does not vary greatly when 
using different run-time predictors. We also find that using our run-time pre- 
dictions does improve the mean wait times in general. In particular, our more 
accurate run-time predictors have the largest effect on mean wait time for the 
ANL workload, which has the highest utilization. In this workload, the mean 
wait times are between 7 and 67 percent lower when using our run-time predic- 
tions than when using other run-time predictions. We also find that on average, 
the mean wait time when using our predictor is within 8 percent of the mean 
wait time that would occur if the scheduler knows the exact run times of the ap- 
plications. The mean wait time when using our technique ranges from 5 percent 
better to 22 percent worse than when scheduling with the actual run times. 

In future work, we will investigate an alternative method for predicting queue 
wait times. This method will use the current state of the scheduling system (num- 
ber of applications in each queue, time of day, etc.) and historical information 
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Table 11. Scheduling performance using maximum run times. 



Workload 


Scheduling 

Algorithm 


Utilization 

(percent) 


Mean Wait Time 
(minutes) 


ANL 


LWF 


70.70 


83.81 


ANL 


Backfill 


71.04 


177.14 


CTC 


LWF 


51.28 


10.48 


CTC 


Backfill 


51.28 


26.86 


SDSC95 


LWF 


41.14 


14.95 


SDSC95 


Backfill 


41.14 


28.20 


SDSC96 


LWF 


46.79 


7.88 


SDSC96 


Backfill 


46.79 


11.34 



Table 12. Scheduling performance using our run-time prediction technique. 



Workload 


Scheduling 

Algorithm 


Utilization 

(percent) 


Mean Wait Time 
(minutes) 


ANL 


LWF 


70.28 


78.22 


ANL 


Backfill 


71.04 


148.77 


CTC 


LWF 


51.28 


13.40 


CTC 


Backfill 


51.28 


22.54 


SDSC95 


LWF 


41.14 


16.19 


SDSC95 


Backfill 


41.14 


22.17 


SDSC96 


LWF 


46.79 


7.79 


SDSC96 


Backfill 


46.79 


10.10 



on queue wait times during similar past states to predict queue wait times. We 
hope this technique will improve wait-time prediction error, particularly for the 
LWF algorithm, which has a large built-in error using the technique presented 
here. Further, we will expand our work in using run-time prediction techniques 
for scheduling to the problem of combining queue-based scheduling and reser- 
vations. Reservations are one way to co-allocate resources in metacomputing 
systems [1,6, 2, 7]. Support for resource co-allocation is crucial to large-scale ap- 
plications that require resources from more than one parallel computer. 
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Table 13. Scheduling performance using Gibbons’ run-time prediction tech- 
nique. 



Workload 


Scheduling 

Algorithm 


Utilization 

(percent) 


Mean Wait Time 
(minutes) 


ANL 


LWF 


70.72 


90.36 


ANL 


Backfill 


71.04 


181.38 


CTC 


LWF 


51.28 


11.04 


CTC 


Backfill 


51.28 


27.31 


SDSC95 


LWF 


41.14 


15.99 


SDSC95 


Backfill 


41.14 


24.83 


SDSC96 


LWF 


46.79 


7.51 


SDSC96 


Backfill 


46.79 


10.82 



Table 14. Scheduling performance using Downey’s conditional average run-time 
predictor. 



Workload 


Scheduling 

Algorithm 


Utilization 

(percent) 


Mean Wait Time 
(minutes) 


ANL 


LWF 


71.04 


154.76 


ANL 


Backfill 


70.88 


246.40 


CTC 


LWF 


51.28 


9.87 


CTC 


Backfill 


51.28 


14.45 


SDSC95 


LWF 


41.14 


16.22 


SDSC95 


Backfill 


41.14 


20.37 


SDSC96 


LWF 


46.79 


7.88 


SDSC96 


Backfill 


46.79 


8.25 



Table 15. Scheduling performance using Downey’s conditional median run-time 
predictor. 



Workload 


Scheduling 

Algorithm 


Utilization 

(percent) 


Mean Wait Time 
(minutes) 


ANL 


LWF 


71.04 


154.76 


ANL 


Backfill 


71.04 


207.17 


CTC 


LWF 


51.28 


11.54 


CTC 


Backfill 


51.28 


16.72 


SDSC95 


LWF 


41.14 


16.36 


SDSC95 


Backfill 


41.14 


19.56 


SDSC96 


LWF 


46.79 


7.80 


SDSC96 


Backfill 


46.79 


8.02 



Using Run-Time Predictions to Estimate Queue Wait Times 



219 



References 

1. C. Catlett and L. Smarr. Metacomputing. Communications of the ACM, 35(6):44- 
52, 1992. 217 

2. K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and 
S. Tuecke. A Resource Management Architecture for Metasystems. Lecture Notes 
on Computer Science, 1998. 202, 217 

3. Allen Downey. Predicting Queue Times on Space-Sharing Parallel Computers. In 
International Parallel Processing Symposium, 1997. 203, 203, 203, 210 

4. N. R. Draper and H. Smith. Applied Regression Analysis, 2nd Edition. John Wiley 
and Sons, 1981. 205 

5. Dror Feitelson and Bill Nitzberg. Job Characteristics of a Production Parallel 
Scientific Workload on the NASA Ames iPSC/860. Lecture Notes on Computer 
Science, 949:337-360, 1995. 203 

6. Ian Foster and Carl Kesselman. Globus: A Metacomputing Infrastructure Toolkit. 
International Journal of Supercomputing Applications, 11(2):115-128, 1997. 217 

7. Ian Foster and Carl Kesselman, editors. The Grid: Blueprint for a New Computing 
Infrastructure. Morgan Kauffmann, 1999. 202, 217 

8. Richard Gibbons. A Historical Application Profiler for Use by Parallel Scheculers. 
Lecture Notes on Computer Science, 1297:58-75, 1997. 203, 203, 203, 209 

9. Richard Gibbons. A Historical Profiler for Use by Parallel Schedulers. Master’s 
thesis. University of Toronto, 1997. 209 

10. David E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine 
Learning. Addison- Wesley, 1989. 207 

11. David A. Lifka. The ANL/IBM SP Scheduling System. Lecture Notes on Computer 
Science, 949:295-303, 1995. 211 

12. Warren Smith, Ian Foster, and Valerie Taylor. Predicting Application Run Times 
Using Historical Information. Lecture Notes on Computer Science, 1459:122-142, 
1998. 203, 203 

13. Neil Weiss and Matthew Hassett. Introductory Statistics. Addison- Wesley, 1982. 
205 



Deterministic Batch Scheduling without Static 

Partitioning 



Kostadis Roussos^, Nawaf Bitar^, and Robert English^ 
1 SGI, 

2011 N. Shoreline Blvd, USA 
kostadisSsgi . com 
^ Network Appliance, 

Santa Clara CA, USA 
{nawaf , renglish}@netapp . com 



Abstract. The Irix 6.5 scheduling system provides intrinsic support for 
batch processing, including support for guaranteed access to resources 
and policy-based static scheduling. Long range scheduling decisions are 
made by Miser, a user level daemon, which reserves the resources needed 
by a batch job to complete its tasks. Short-term decisions are made by the 
kernel in accordance with the reservations established by Miser. Unlike 
many batch systems, the processes in a batch job remain in the system 
from the time originally scheduled until the job completes. This gives 
the system considerable flexibility in choosing jobs to use idle resources. 
Unreserved or reserved but unused resources are available to either in- 
teractive jobs or batch jobs that haven’t yet been scheduled. The system 
thus gains the advantages of static partitioning and static scheduling, 
without their inherent costs. 



1 Introduction 

Supercomputers have two distinct resource management problems. The first is 
the allocation of resources between batch and interactive applications while guar- 
anteeing deterministic deadlines. In order to ensure deterministic run times, 
batch applications require that resources become available according to some 
fixed schedule and not be reclaimed until the application terminates. Interac- 
tive users expect that resources become available immediately, but can tolerate 
time-sharing of those resources. When both classes of users share a machine, the 
batch users experience poor performance since resources are being time-shared 
with interactive applications. To remedy this problem, supercomputer resources 
are typically partitioned statically between interactive and batch resource pools 
that cannot be shared, resulting in wasted idle resources. 

The second resource management problem is how to improve throughput 
of systems that only run batch applications while still guaranteeing resource 
availability and thus deterministic run times. Batch systems that guarantee re- 
source availability require that applications specify the resources they require, 
and will only schedule an application on a supercomputer if there are sufficient 
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free resources. This approach results in idle resources, either because the users 
overestimate the resources they require, or because there is no batch application 
that is small enough to run on the subset of resources available on the computer. 

We present in this a paper a new approach, consisting of a user level resource 
manager, Miser, and kernel level support code that address both resource man- 
agement problems, thus improving throughput and overall system performance. 
Miser is responsible for generating a schedule of start and end times for batch 
applications such that the resources are never over-committed. The kernel, on 
the other hand, is responsible for ensuring that resources guaranteed by Miser 
are made available according to the schedule specified by Miser, while simultane- 
ously making idle resources available to applications that are not yet scheduled 
to run. 

To support the CPU scheduling requirements of the Miser resource manager, 
namely good interactive behavior, non-interruptible CPU time to particular ap- 
plications, and dynamic partitioning of CPUs between batch and interactive 
applications, we implemented a new scheduler. To support the physical memory 
requirements, namely guaranteed physical memory for particular classes of ap- 
plications while simultaneously not wasting physical memory, we added a new 
accounting data structure to the virtual memory subsystem. The new account- 
ing data structure allows us to reserve memory for applications without actually 
allocating the memory. As a result, memory is guaranteed to be available, but 
remains unused until the application actually requires it, and therefore, available 
to other applications. 

It is the combination of scheduler and VM kernel level support that en- 
ables our system to deliver better throughput. Our approach enables system 
administrators to have deterministic run times for batch applications without 
the inherent waste in static partitioning, as well as improving throughput on 
dedicated batch systems. 

Our paper is divided into five sections. The first section will compare Miser 
with existing batch systems and describe some theoretical work. The second 
section will then describe Miser, showing how Miser manages supercomputer 
resources so as to generate a schedule of applications that does not over-commit 
the system. The second section will also describe the interaction between Miser 
and the kernel. The third section will describe the kernel modifications to the 
scheduler and VM required to support resource reservation and resource sharing. 
The fourth section will present some empirical evidence that Miser does indeed 
work. The paper will conclude with remarks on future directions for Miser and 
remarks on the work as a whole. 



2 Related Work 



Current commercial batch schedulers only provide static mechanisms for manag- 
ing the workload between batch and interactive jobs on systems using tools such 
as PSET [PSET], cpusets [CPUSET], and the HP sub-complex manager [SCM]. 
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Ashok and Zahorjan [AaZa92] considered a more dynamic mechanism of al- 
locating resources between batch and interactive users. They proposed the par- 
titioning of memory between interactive and batch applications and proposed 
that CPUs be available on demand to interactive applications. Their approach 
results in wasted memory when there is insufficient interactive or batch load to 
utilize the reserved memory and, by allowing arbitrary preemption of CPUs to 
favor interactive users, they cannot guarantee deterministic run times. 

Other commercial batch schedulers [LSF][LL] are able to manage resource 
utilization between batch jobs so as to prevent over-subscription. However, they 
do not manage resource utilization of applications that are scheduled outside of 
the batch system. Consequently, on a machine that is shared by interactive and 
batch users, a batch system can not guarantee resource availability to batch jobs 
and thus deterministic run times, because it can not control resource usage of 
the interactive jobs. 

Commercial batch schedulers allow resources to go idle even if there is work 
on the system either because of holes in the schedule or because jobs are not 
using all the resources they requested. A hole is a point in a schedule of batch 
jobs where there are still free resources available, but no job sufficiently small 
enough to use them. A conventional batch system can not start a job that is 
too big to fit in the hole, because it can not restrict a job to a subset of its 
resources. Commercial batch schedulers can not exploit resources that are com- 
mitted but unused, because it has no mechanism for reclaiming them later on if 
the application should require them. 

3 Miser 

Miser is a resource manager that runs as a user-level daemon to support batch 
scheduling. Miser, given a set of resources to manage, a policy to manage the 
resources, and a job with specified space and time requirements will generate a 
start and end time for that job. Given a set of these jobs. Miser will generate a 
schedule such that the resources are never oversubscribed. Miser will then pass 
the start and end times to the kernel, which will manage the actual allocation 
of physical resources to the applications according to the schedule defined by 
Miser. 



3.1 Resource Accounting 

Miser’s resource management was designed both to guarantee that resources al- 
located to Miser are never oversubscribed, and to give system administrators fine 
grain control over those resources. The basic abstraction for resource accounting 
is the resource pool. The resource pool keeps track of the availability of resources 
over time. To guarantee that resources are never oversubscribed. Miser deducts 
resources from the resource pool as it schedules applications; if an application 
requests more resources than are available, it is rejected. To enable fine grain 
control over the resource pools, Miser uses a two-tiered hierarchy of pools. At 
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the top level is a single resource pool, the system pool, that represents the total 
resources available to the system. Below the system pool are one or more user 
pools whose aggregate resources are equal to or less than the resources available 
to the system pool. A two-tiered hierarchy of pools, whose size can vary over 
time, allows administrators to control access to resources either by restricting 
access to user pools, or by controlling how many resources each particular user 
pools has. For example, a 32 CPU system that needs to be shared equally be- 
tween three departments for batch use and the university for interactive use, 
would have Miser configured to manage 24 CPUs leaving 8 CPUs for general 
purpose use. The Miser resources would then in turn be divided into three user 
pools of 8 CPUs each that would have their access restricted to users from their 
respective departments (see figure 1). 

3.2 Job Scheduling 

Job scheduling by Miser is similar to job scheduling by other batch schedulers; 
applications are submitted to Miser and Miser determines when they can run. 
Unlike conventional batch schedulers, applications that are scheduled to run are 
not waiting to be started on the host system, but are in a suspended state inside 
of IRIX waiting for the kernel to actually allocate them real physical resources 
so that they can run. As a result, the kernel can start the application before 
its scheduled start time by giving it any currently idle resources. Furthermore, 
because it is the kernel that is allocating actual physical resources, and not a 
user level process, the kernel is also able to reclaim any idle resources if it needs 
them to run another batch process or interactive workload. 

To schedule a job, users submit jobs to Miser using the miser_submit com- 
mand. Miser uses a user-defined policy that has an associated user resource pool 
to schedule jobs. The policy and the resource pool are collectively referred to as 
a queue (see figure 1). The miser .submit command requires the user to specify 
the queue to which the job will be submitted, a resource request, and the job in- 
vocation. The resource request is a tuple of time and space: the time is the total 
CPU wall-clock time and the space is the logical number of CPUs and physical 
memory required. Upon submission, Miser schedules the job using the queue’s 
associated scheduling policy and resources and returns a guaranteed start and 
completion time for the job, if there are sufficient resources, otherwise the job is 
rejected. 

Once Miser successfully schedules a job, and miser.submit starts it, the job 
waits in the kernel for resources to be made available so it can run. Prior to the 
job’s start time, it is in batch state. A job in batch state may run opportunistically. 
That is, it may run on the otherwise idle resources of other pools. Idle batch 
queue resources are first made available to interactive users and then to batch 
queues; idle interactive resources are made available to the batch queues. When 
a job’s start time arrives, the kernel transitions the job to the batch-critical state 
and provides it with the resources it requested from Miser allowing it to run. 

Figure 2 shows an example of how Miser schedules a job. The user, in this 
example, is submitting to the math queue from figure 1 a program called Tuple- 
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Fig. 1. An example Miser conifguration on a 32 CPU system. Miser is configured 
with 24 CPUs in its system pool and there are three queues with user pools of 
eight CPUs each. The reamining eight CPUs are being used by the general 
system. 



Count that requires 4 CPUs, 10 hours of wall clock time and 100 MB of memory. 
The user specifies the resource requirements, and the program to be scheduled 
by Miser using miser_submit. Miser upon receipt of the job scheduling request, 
first tries to find the math queue, verifies that the user has sufficient privileges 
to schedule the job, and then asks the associated policy of the queue to schedule 
the request. The policy uses the resource pool associated with the queue to find a 
hole large enough to fit the job. Once the policy has found the hole, the resources 
are committed and the start and end times of the job are passed both to the 
kernel and the miser .submit command that then starts TupleCount. TupleCount 
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after being successfully scheduled by Miser now waits in the kernel for resources 
to be made available to 




Fig. 2. Scheduling of a job by Miser 



While TupleCount is waiting in the kernel for its start time to arrive, the 
kernel may let it run oppurtunistically on idle resources. These idle resources 
can either be part of the system resources that are unused by the interactive 
portion of the machine, or batch resources that are unused by other queues. 
So for example, if the physics queue is idle and there is no other interactive 
load while TupleCount is waiting in the kernel, the kernel will let it allocate 
sufficient resources to begin running. If TupleCount, however, tries to use more 
resources than are idle, then the kernel will suspend it. Similarly, any resources 
that TupleCount uses can be reclaimed, while the job is in the batch state to 
allow a physics or interactive job to run. When TupleCount’s start time does 
arrive, however, the kernel transitions the job to the batch- critical state, wakes 
it up if necessary and allows it to allocate resources up to the total requested. 



3.3 Kernel - Miser Interaction 

The Miser-kernel interaction is limited to Miser providing sufficient information 
to the kernel such that applications are started and terminated by the kernel 
according to the schedule determined by Miser with the resources reserved by 
Miser. As each job is submitted to Miser, its parameters are passed to the kernel. 
After the job parameters have been passed to the kernel, the daemon does not 
intervene until the job has terminated. The parameters passed to the kernel do 
not include the queue information since the queue is a user-level abstraction. If 
the job terminates early, the kernel notifies Miser, so it can take any action it 
wants at that time. 
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4 Kernel Support 

While it is the responsibility of the Miser daemon to generate a schedule of jobs 
that does not over-commit the resources of the machine, it is the responsibility 
of the kernel to manage system resources such that applications receive the 
resources requested and thus meet deadlines. Miser currently manages CPU and 
memory resources. The kernel support consists of a new batch scheduling policy 
for the IRIX scheduler and modifications to the virtual memory system. The 
kernel scheduler and batch scheduling policy are responsible for ensuring that 
jobs transition from batch to batch- critical state according to the user-supplied 
schedule, that batch- critical applications get the CPUs they requested, and that 
idle CPUs are made available to either batch or interactive applications. The 
virtual memory subsystem is responsible for providing physical memory to batch- 
critical applications and for ensuring that idle physical memory is available to 
jobs in the batch state and also to nteractive applications. 

4.1 The Scheduler 

To support the CPU scheduling requirements of the Miser resource manager, 
namely good interactive behavior, non-interruptible CPU time to particular ap- 
plications, and dynamic partitioning of CPUs between batch and interactive 
applications, we implemented a new scheduler. It consists of a fully preemp- 
tive, priority based scheduler that has a number of simple abstractions: kernel 
threads, CPU run queues, a unified priority range and a batch and interactive 
scheduling policy. The operation of the scheduler is conceptually simple: when- 
ever a CPU needs to make a scheduling decision it looks for work on its local 
run queue, the run queues of other CPUs, and any queues maintained by the 
scheduling policies. A scheduling decision is required whenever a higher priority 
thread becomes runnable, the current thread ran to the end of its time slice, 
or yielded the CPU. At a scheduling decision the processor examines, the local 
run queue for any work, and for purposes of load balancing, the local queue of 
another randomly selected CPU for a thread with a higher priority that can be 
stolen from the remote local run queue. If there is no work on the either the 
processors local run queue or on any other local run queue, then the processor, 
will look at the run queues maintained by the various scheduling policies. The 
interactive and batch scheduling policies can affect the scheduling decisions of 
the scheduler by modifying the placement on run queues and the priority of 
threads. The scheduler architecture enables both guaranteed CPU time and dy- 
namic partitioning of CPU resources. CPU time is guaranteed by boosting the 
priority of any thread of a process so that it cannot be preempted during its 
execution. Dynamic partitioning is possible because the scheduler is fully pre- 
emptive. A CPU can run any thread when it is idle, because if a higher priority 
thread suddenly becomes runnable it will immediately preempt a lower priority 
thread. 

The batch scheduling policy has two goals. The first is to ensure that Miser 
jobs receive the requested CPU time. The second goal is to start jobs early on 
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idle CPUs. To ensure that an application receives its requested CPU time, the 
scheduler must accomplish two tasks. The first is guarantee that an application 
runs for the total requested time without interruption. This is accomplished by 
raising the priority of the Miser job when it becomes hatch- critical to a value 
higher than that of any interactive job. To guarantee that no two jobs in the 
hatch- critical state time-slice with one another, the CPUs are partitioned be- 
tween distinct jobs, and the threads of a particular job only run on the job’s 
partition. The second task is to ensure that an application does not use any 
more CPU resources than requested, either by running on more CPUs than re- 
quested, or by over-running its time because that will prevent other applications 
from receiving their allotted time. This is accomplished by restricting the set 
of CPUs that an application can run on while it is hatch- critical and by termi- 
nating the application should it over-run its time. The second goal of the batch 
scheduler is accomplished by maintaining a private queue of jobs in the hatch 
state and using it to generate work for idle CPUs. These threads run on CPUs 
until they are pre-empted by time-share threads or hatch- critical threads. 

Figure 3 illustrates how dynamic partitioning takes place between batch and 
interactive threads. Initially, all the threads of the batch job are suspended and 
the only active threads are interactive threads (figure 3. a). When the first batch 
thread becomes active because the job has become hatch- critical, it is placed on 
the run queue of CPU 4 (figure 3.b). At the next scheduling decision, the CPU 
selects the batch thread to run since it has a higher priority and the interactive 
thread ends up on another run queue as a result of load balancing (figure 3.c). 
Later, the second batch thread becomes active, and the interactive thread on 
CPU 1 is pre-empted so that the batch thread can run (figure 3.d). As a result 
of these preemption the CPUs are now partitioned evenly between batch and 
interactive jobs (figure 3.e). At some point in the future, the batch job exits, 
and CPUs 1 and 4 go idle (figure 3.f). At that point the CPUs go looking for 
work, and begin to run the interactive threads that are queued on CPU 2 and 3 
(figure 3.g). 

4.2 Virtual Memory Subsystem 

To support the notion of guaranteed physical memory to a particular class of 
applications related by scheduling discipline without wasting unused physical 
memory, a new accounting mechanism was added to the virtual memory subsys- 
tem. The VM did not require major changes because it was already capable of 
guaranteeing physical memory to particular applications. What it was not capa- 
ble of doing is account for memory across a set of applications related only by 
scheduling discipline. The new accounting data structure used is called a mem- 
ory pool. There are currently only two pools, a Miser pool and a general pool. 
The Miser pool is used to keep track of the total amount of memory available 
for use by applications in the hatch- critical state. The general pool is used to 
keep track of the total amount of physical memory and swap, also called logical 
swap, available to the rest of the system. By modifying the size of the general 
pool it is possible to reserve enough memory for the Miser pool. Since no actual 
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Fig. 3. Example of Dynamic Partitioning of CPUs. As the number of batch 
threads changes so does the number of the CPUs in the batch partition. The 
changes take effect after every scheduling decision 



physical memory is reserved, any job using the general pool can use physical 
memory equal in size to the logical swap. 

Batch applications when running opportunistically use the global pool and 
then transition to the Miser pool. To prevent batch jobs from failing a system 
call or memory allocation because of insufficient memory in the global pool, the 
job is suspended until it becomes hatch- critical and the operation is restarted. 

In figure 4 we show an example of how both the memory accounting and 
memory usage varies on a batch system where there is a batch and interactive 
job running. The boxes represent each of the resources, the Miser and global 
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memory pools as well as the physical and swap memory. The shaded regions of 
the boxes represent how much of a particular resource a particular job has at 
any point in time. The shading color indicates the job type. 



Fig. 4(a) Fig. 4(b) 




Fig. 4. VM state transition diagram 



The machine in figure 4 is initially configured with 100MB of memory in the 
global pool, and 100 MB in the Miser pool. The system also has 100 MB of swap 
space and 100MB of physical memory (figure 4. a). The interactive job is the first 
job to start and requires 50MB of memory, so it is allocated 50MB from the global 
pool (figure 4.b). Since there is no other workload on the system, the interactive 
program is able to acquire 50MB physical memory. Later a batch job is started 
that has requested 100MB of memory (figure 4.c). There are idle resources on 
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the system, so the batch application begins to run opportunistically, using up the 
remaining 50 MB of the global pool. Since there is no other work on the system, 
the batch job is also able to use the remaining 50MB of physical memory, and 
now all physical memory is being used by applications. The batch application 
had requested 100MB, but there is no more memory in the global pool, so it is 
suspended. At its start time, the batch job is transitioned to the batch- critical 
state, and the job now claims 100MB from the Miser pool, and releases the 
50MB from the global pool (figure 4.d). Now the global pool has 50MB free, and 
the Miser pool 0. The batch application, since it has preference over physical 
memory, forces the interactive application to be swapped out because there is 
no other free physical memory. The usage of the physical and swap memory now 
becomes 100MB for the batch job and 50MB for the interactive job respectively. 
When the batch job finally terminates, the Miser pool grows to 100MB, and the 
free physical memory is used once again by interactive program (figure 4.e). At 
this point there is 50MB of physical memory free and 50MB free in the global 
pool. 

5 Empirical Evidence 

The goals of these experiments are to demonstrate that there is no performance 
cost and that the overall throughput of the system improves as a result of using 
Miser. The scheduling policy used by the Miser queue was first fit, the default 

Our experiments were conducted on a 16 processor Origin 2000 using eg, 
bt, and ep from the NAS benchmarks^. We ran the benchmarks repeatedly 
and reported either individual performance numbers or the number of times a 
benchmark was able to complete over a period of time. The load was generated 
using a CPU cycle burner and the load number represents the number of cycle 
burners used. 

5.1 Experiments 

The first experiment shows the performance of eg, bt, and ep with different 
amounts of load (figure 5). For each benchmark, we ran the benchmark with 
varying degrees of load both using and not using Miser. When the load was 0, 
the performance of the application under Miser and not under Miser was identi- 
cal, indicating that there is no performance penalty for using Miser. As the load 
increases, however, the applications not scheduled by Miser see degradation in 
performance. Applications scheduled by Miser, however, do not. This demon- 
strates that Miser is able to guarantee deterministic run times for applications 
even with large interactive load. 

The next two experiments show that Miser can share idle resources between 
the batch and interactive portions of the machine. To measure this, we mea- 
sured the throughput of the machine by simultaneously starting 5 copies of each 

^ The performance of the benchmarks cannot be considered official SGI benchmark 
values. 
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Fig. 5. First Experiment 

benchmark and recording both the total wall clock times for all five benchmarks 
to run, and the average latency. The benchmarks in all cases were run using eight 
threads. Miser was configured to manage half of the total system resources. 

The second experiment (see figure 6) measured the latency of applications on 
a 16 CPU machine with Miser configured, but not using it and compares it to the 
performance of the benchmarks on an 8 CPU machine. For each benchmark, note 
that the latency improved dramatically, when compared to the performance of 
the benchmark on the smaller machine. The results show that resources reserved 
by Miser can be used by interactive applications. A 16 CPU machine configured 
to use Miser will result in better average latencies for interactive applications 
than two single 8 CPU machines reserved for batch and interactive users if the 
batch portions of the machine are idle. 

The third experiment (see figure 7) examines the difference in overall through- 
put in a batch-scheduling environment. The total time taken for the jobs to 
complete their runs was measured when scheduled by a simulated batch sched- 
uler, by Miser but with the jobs not taking advantage of the idle resources, and 
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Fig. 6. Second Experiment 



finally with Miser and with jobs taking advantage of the idle resources. To sim- 
ulate a batch environment, applications were started sequentially until the total 
resources of the machine were consumed. On the 16 CPU system two 8- way 
threaded copies of each application ran at the same time. First note that the 
performance of batch applications using Miser when using idle resources is better 
than the performance of the applications when Miser does not take advantage of 
idle resources. This shows that Miser can indeed take advantage of idle resources 
to improve total throughput. The second thing to observe is that performance 
of applications using Miser is better than that of the simulated batch scheduler 
when the applications scheduled by Miser can use the idle resources. 

Jobs scheduled by Miser perform better than the simulated batch scheduler 
because of the way Miser schedules applications. The individual threads are 
always run together, the threads are never preempted, and the threads of distinct 
applications do not interfere with each other. This results in two benefits. The 
first is that applications make better use of the memory system. The second is 
that application threads are more efficiently co-scheduled. Note, however, that 
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Fig. 7. Third Experiment 

the improvement depends on how sensitive the applications are to co-scheduling. 
The amount of sensitivity depends on how frequently the application performs 
busy-wait synchronization. Ep that does no busy wait synchronization showed 
no improvement and eg that does the most showed the most with bt being in 
the middle 

6 Future Work 

Miser does not fully exploit the NUMA properties of the underlying architecture 
very well. Although Miser is able to reserve total memory well, in order to achieve 
optimal performance on NUMA systems, Miser will need to reserve memory on 
specific nodes, and also reserve particular topologies. In this case Miser would 
have a fundamental advantage over normal kernel mechanisms because Miser 
jobs have known run times. 

Miser currently requires that applications not use more than the available 
physical memory. If the application requires more. Miser can not schedule the 
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application. A better solution would be to allow batch jobs to self-page. In this 
model, the batch application still has a reservation of memory, but it also has a 
reservation of swap. It can thus swap out portions of its physical memory to disk, 
so as to have a working size that is potentially much larger than the physical 
memory size. 

Miser was originally envisioned as a general-purpose resource management 
facility for scheduling applications that required particular physical resources 
to run. We hope to extend Miser to other classes of applications such as real- 
time, and to manage more resources such as disk. The problem with real-time 
on IRIX, is that the configuration of a real-time application requires a multi- 
step process that is error prone. Using Miser, it would be possible to configure 
a particular queue that had the desired real-time attributes, and to start a real- 
time application by simply submitting the application to Miser. Miser would 
then take the necessary kernel level actions to guarantee the requirements of 
the real-time application rather than leave it to the application writer. This 
approach not only is less error prone, it allows a real-time system to be shared 
by different users, in a way that prevents them from interfering with each other. 

Finally, although there is support for user-defined policies, the mechanisms 
have not yet been fully defined. We hope to define and export interfaces that 
would allow users to provide scheduling policies. 

6.1 Conclusions 

Theoretical results have shown that dynamic partitioning of CPUs and the static 
partitioning of memory provide the best batch throughput and interactive re- 
sponse time. These theoretical results must be balanced against the real require- 
ment for deterministic batch scheduling that has forced system administrators to 
statically partition memory and CPU time. Our system departs from the norm 
of user- level schedulers by providing kernel support. Using the underlying kernel 
scheduler we are able to guarantee CPU time and memory to batch jobs and 
are thus able to guarantee deadlines for particular applications. Furthermore, 
because we have no static scheduling, we are able to schedule jobs on CPUs that 
other batch schedulers must leave idle in order to achieve guaranteed perfor- 
mance, and thus achieve better throughput as we have demonstrated. Miser is a 
new mechanism for scheduling batch jobs with deterministic deadlines without 
the inherent waste of resources that result from static partitioning. 
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